Unicode and passing strings

Similar to the string semantics in Python 3, Cython strictly separatesbyte strings and unicode strings. Above all, this means that by defaultthere is no automatic conversion between byte strings and unicode strings(except for what Python 2 does in string operations). All encoding anddecoding must pass through an explicit encoding/decoding step. To easeconversion between Python and C strings in simple cases, the module-levelc_string_type and c_string_encoding directives can be used toimplicitly insert these encoding/decoding steps.

Python string types in Cython code

Cython supports four Python string types: bytes, str,unicode and basestring. The bytes and unicode typesare the specific types known from normal Python 2.x (named bytesand str in Python 3). Additionally, Cython also supports thebytearray type which behaves like the bytes type, exceptthat it is mutable.

The str type is special in that it is the byte string in Python 2and the Unicode string in Python 3 (for Cython code compiled withlanguage level 2, i.e. the default). Meaning, it always correspondsexactly with the type that the Python runtime itself calls str.Thus, in Python 2, both bytes and str represent the byte stringtype, whereas in Python 3, both str and unicode represent thePython Unicode string type. The switch is made at C compile time, thePython version that is used to run Cython is not relevant.

When compiling Cython code with language level 3, the str type isidentified with exactly the Unicode string type at Cython compile time,i.e. it does not identify with bytes when running in Python 2.

Note that the str type is not compatible with the unicodetype in Python 2, i.e. you cannot assign a Unicode string to a variableor argument that is typed str. The attempt will result in eithera compile time error (if detectable) or a TypeError exception atruntime. You should therefore be careful when you statically type astring variable in code that must be compatible with Python 2, as thisPython version allows a mix of byte strings and unicode strings for dataand users normally expect code to be able to work with both. Code thatonly targets Python 3 can safely type variables and arguments as eitherbytes or unicode.

The basestring type represents both the types str and unicode,i.e. all Python text string types in Python 2 and Python 3. This can beused for typing text variables that normally contain Unicode text (atleast in Python 3) but must additionally accept the str type inPython 2 for backwards compatibility reasons. It is not compatible withthe bytes type. Its usage should be rare in normal Cython code asthe generic object type (i.e. untyped code) will normally be goodenough and has the additional advantage of supporting the assignment ofstring subtypes. Support for the basestring type was added in Cython0.20.

String literals

Cython understands all Python string type prefixes:

  • b'bytes' for byte strings
  • u'text' for Unicode strings
  • f'formatted {value}' for formatted Unicode string literals as defined byPEP 498 (added in Cython 0.24)

Unprefixed string literals become str objects when compilingwith language level 2 and unicode objects (i.e. Python 3str) with language level 3.

General notes about C strings

In many use cases, C strings (a.k.a. character pointers) are slowand cumbersome. For one, they usually require manual memorymanagement in one way or another, which makes it more likely tointroduce bugs into your code.

Then, Python string objects cache their length, so requesting it(e.g. to validate the bounds of index access or when concatenatingtwo strings into one) is an efficient constant time operation.In contrast, calling strlen() to get this informationfrom a C string takes linear time, which makes many operations onC strings rather costly.

Regarding text processing, Python has built-in support for Unicode,which C lacks completely. If you are dealing with Unicode text,you are usually better off using Python Unicode string objects thantrying to work with encoded data in C strings. Cython makes thisquite easy and efficient.

Generally speaking: unless you know what you are doing, avoidusing C strings where possible and use Python string objects instead.The obvious exception to this is when passing them back and forthfrom and to external C code. Also, C++ strings remember their lengthas well, so they can provide a suitable alternative to Python bytesobjects in some cases, e.g. when reference counting is not neededwithin a well defined context.

Passing byte strings

we have dummy C functions declared ina file called c_func.pyx that we are going to reuse throughout this tutorial:

  1. from libc.stdlib cimport malloc
  2. from libc.string cimport strcpy, strlen
  3.  
  4. cdef char* hello_world = 'hello world'
  5. cdef Py_ssize_t n = strlen(hello_world)
  6.  
  7.  
  8. cdef char* c_call_returning_a_c_string():
  9. cdef char* c_string = <char *> malloc((n + 1) * sizeof(char))
  10. if not c_string:
  11. raise MemoryError()
  12. strcpy(c_string, hello_world)
  13. return c_string
  14.  
  15.  
  16. cdef void get_a_c_string(char** c_string_ptr, Py_ssize_t *length):
  17. c_string_ptr[0] = <char *> malloc((n + 1) * sizeof(char))
  18. if not c_string_ptr[0]:
  19. raise MemoryError()
  20.  
  21. strcpy(c_string_ptr[0], hello_world)
  22. length[0] = n

We make a corresponding c_func.pxd to be able to cimport those functions:

  1. cdef char* c_call_returning_a_c_string()
  2. cdef void get_a_c_string(char** c_string, Py_ssize_t *length)

It is very easy to pass byte strings between C code and Python.When receiving a byte string from a C library, you can let Cythonconvert it into a Python byte string by simply assigning it to aPython variable:

  1. from c_func cimport c_call_returning_a_c_string
  2.  
  3. cdef char* c_string = c_call_returning_a_c_string()
  4. cdef bytes py_string = c_string

A type cast to object or bytes will do the same thing:

  1. py_string = <bytes> c_string

This creates a Python byte string object that holds a copy of theoriginal C string. It can be safely passed around in Python code, andwill be garbage collected when the last reference to it goes out ofscope. It is important to remember that null bytes in the string actas terminator character, as generally known from C. The above willtherefore only work correctly for C strings that do not contain nullbytes.

Besides not working for null bytes, the above is also very inefficientfor long strings, since Cython has to call strlen() on theC string first to find out the length by counting the bytes up to theterminating null byte. In many cases, the user code will know thelength already, e.g. because a C function returned it. In this case,it is much more efficient to tell Cython the exact number of bytes byslicing the C string. Here is an example:

  1. from libc.stdlib cimport free
  2. from c_func cimport get_a_c_string
  3.  
  4.  
  5. def main():
  6. cdef char* c_string = NULL
  7. cdef Py_ssize_t length = 0
  8.  
  9. # get pointer and length from a C function
  10. get_a_c_string(&c_string, &length)
  11.  
  12. try:
  13. py_bytes_string = c_string[:length] # Performs a copy of the data
  14. finally:
  15. free(c_string)

Here, no additional byte counting is required and length bytes fromthe c_string will be copied into the Python bytes object, includingany null bytes. Keep in mind that the slice indices are assumed to beaccurate in this case and no bounds checking is done, so incorrectslice indices will lead to data corruption and crashes.

Note that the creation of the Python bytes string can fail with anexception, e.g. due to insufficient memory. If you need tofree() the string after the conversion, you should wrapthe assignment in a try-finally construct:

  1. from libc.stdlib cimport free
  2. from c_func cimport c_call_returning_a_c_string
  3.  
  4. cdef bytes py_string
  5. cdef char* c_string = c_call_returning_a_c_string()
  6. try:
  7. py_string = c_string
  8. finally:
  9. free(c_string)

To convert the byte string back into a C char*, use theopposite assignment:

  1. cdef char* other_c_string = py_string # other_c_string is a 0-terminated string.

This is a very fast operation after which other_c_string points tothe byte string buffer of the Python string itself. It is tied to thelife time of the Python string. When the Python string is garbagecollected, the pointer becomes invalid. It is therefore important tokeep a reference to the Python string as long as the char*is in use. Often enough, this only spans the call to a C function thatreceives the pointer as parameter. Special care must be taken,however, when the C function stores the pointer for later use. Apartfrom keeping a Python reference to the string object, no manual memorymanagement is required.

Starting with Cython 0.20, the bytearray type is supported andcoerces in the same way as the bytes type. However, when using itin a C context, special care must be taken not to grow or shrink theobject buffer after converting it to a C string pointer. Thesemodifications can change the internal buffer address, which will makethe pointer invalid.

Accepting strings from Python code

The other side, receiving input from Python code, may appear simpleat first sight, as it only deals with objects. However, getting thisright without making the API too narrow or too unsafe may not beentirely obvious.

In the case that the API only deals with byte strings, i.e. binarydata or encoded text, it is best not to type the input argument assomething like bytes, because that would restrict the allowedinput to exactly that type and exclude both subtypes and other kindsof byte containers, e.g. bytearray objects or memory views.

Depending on how (and where) the data is being processed, it may be agood idea to instead receive a 1-dimensional memory view, e.g.

  1. def process_byte_data(unsigned char[:] data):
  2. length = data.shape[0]
  3. first_byte = data[0]
  4. slice_view = data[1:-1]
  5. # ...

Cython’s memory views are described in more detail inTyped Memoryviews, but the above example already showsmost of the relevant functionality for 1-dimensional byte views. Theyallow for efficient processing of arrays and accept anything that canunpack itself into a byte buffer, without intermediate copying. Theprocessed content can finally be returned in the memory view itself(or a slice of it), but it is often better to copy the data back intoa flat and simple bytes or bytearray object, especiallywhen only a small slice is returned. Since memoryviews do not copy thedata, they would otherwise keep the entire original buffer alive. Thegeneral idea here is to be liberal with input by accepting any kind ofbyte buffer, but strict with output by returning a simple, well adaptedobject. This can simply be done as follows:

  1. def process_byte_data(unsigned char[:] data):
  2. # ... process the data, here, dummy processing.
  3. cdef bint return_all = (data[0] == 108)
  4.  
  5. if return_all:
  6. return bytes(data)
  7. else:
  8. # example for returning a slice
  9. return bytes(data[5:7])

For read-only buffers, like bytes, the memoryview item type shouldbe declared as const (see Read-only views). If the byte input isactually encoded text, and the further processing should happen at theUnicode level, then the right thing to do is to decode the input straightaway. This is almost only a problem in Python 2.x, where Python codeexpects that it can pass a byte string (str) with encoded text intoa text API. Since this usually happens in more than one place in themodule’s API, a helper function is almost always the way to go, since itallows for easy adaptation of the input normalisation process later.

This kind of input normalisation function will commonly look similar tothe following:

  1. # to_unicode.pyx
  2.  
  3. from cpython.version cimport PY_MAJOR_VERSION
  4.  
  5. cdef unicode _text(s):
  6. if type(s) is unicode:
  7. # Fast path for most common case(s).
  8. return <unicode>s
  9.  
  10. elif PY_MAJOR_VERSION < 3 and isinstance(s, bytes):
  11. # Only accept byte strings as text input in Python 2.x, not in Py3.
  12. return (<bytes>s).decode('ascii')
  13.  
  14. elif isinstance(s, unicode):
  15. # We know from the fast path above that 's' can only be a subtype here.
  16. # An evil cast to <unicode> might still work in some(!) cases,
  17. # depending on what the further processing does. To be safe,
  18. # we can always create a copy instead.
  19. return unicode(s)
  20.  
  21. else:
  22. raise TypeError("Could not convert to unicode.")

And should then be used like this:

  1. from to_unicode cimport _text
  2.  
  3. def api_func(s):
  4. text_input = _text(s)
  5. # ...

Similarly, if the further processing happens at the byte level, but Unicodestring input should be accepted, then the following might work, if you areusing memory views:

  1. # define a global name for whatever char type is used in the module
  2. ctypedef unsigned char char_type
  3.  
  4. cdef char_type[:] _chars(s):
  5. if isinstance(s, unicode):
  6. # encode to the specific encoding used inside of the module
  7. s = (<unicode>s).encode('utf8')
  8. return s

In this case, you might want to additionally ensure that byte stringinput really uses the correct encoding, e.g. if you require pure ASCIIinput data, you can run over the buffer in a loop and check the highestbit of each byte. This should then also be done in the input normalisationfunction.

Dealing with “const”

Many C libraries use the const modifier in their API to declarethat they will not modify a string, or to require that users mustnot modify a string they return, for example:

  1. typedef const char specialChar;
  2. int process_string(const char* s);
  3. const unsigned char* look_up_cached_string(const unsigned char* key);

Cython has support for the const modifier inthe language, so you can declare the above functions straight away asfollows:

  1. cdef extern from "someheader.h":
  2. ctypedef const char specialChar
  3. int process_string(const char* s)
  4. const unsigned char* look_up_cached_string(const unsigned char* key)

Decoding bytes to text

The initially presented way of passing and receiving C strings issufficient if your code only deals with binary data in the strings.When we deal with encoded text, however, it is best practice to decodethe C byte strings to Python Unicode strings on reception, and toencode Python Unicode strings to C byte strings on the way out.

With a Python byte string object, you would normally just call thebytes.decode() method to decode it into a Unicode string:

  1. ustring = byte_string.decode('UTF-8')

Cython allows you to do the same for a C string, as long as itcontains no null bytes:

  1. from c_func cimport c_call_returning_a_c_string
  2.  
  3. cdef char* some_c_string = c_call_returning_a_c_string()
  4. ustring = some_c_string.decode('UTF-8')

And, more efficiently, for strings where the length is known:

  1. from c_func cimport get_a_c_string
  2.  
  3. cdef char* c_string = NULL
  4. cdef Py_ssize_t length = 0
  5.  
  6. # get pointer and length from a C function
  7. get_a_c_string(&c_string, &length)
  8.  
  9. ustring = c_string[:length].decode('UTF-8')

The same should be used when the string contains null bytes, e.g. whenit uses an encoding like UCS-4, where each character is encoded in fourbytes most of which tend to be 0.

Again, no bounds checking is done if slice indices are provided, soincorrect indices lead to data corruption and crashes. However, usingnegative indices is possible and will inject a callto strlen() in order to determine the string length.Obviously, this only works for 0-terminated strings without internalnull bytes. Text encoded in UTF-8 or one of the ISO-8859 encodings isusually a good candidate. If in doubt, it’s better to pass indicesthat are ‘obviously’ correct than to rely on the data to be as expected.

It is common practice to wrap string conversions (and non-trivial typeconversions in general) in dedicated functions, as this needs to bedone in exactly the same way whenever receiving text from C. Thiscould look as follows:

  1. from libc.stdlib cimport free
  2.  
  3. cdef unicode tounicode(char* s):
  4. return s.decode('UTF-8', 'strict')
  5.  
  6. cdef unicode tounicode_with_length(
  7. char* s, size_t length):
  8. return s[:length].decode('UTF-8', 'strict')
  9.  
  10. cdef unicode tounicode_with_length_and_free(
  11. char* s, size_t length):
  12. try:
  13. return s[:length].decode('UTF-8', 'strict')
  14. finally:
  15. free(s)

Most likely, you will prefer shorter function names in your code basedon the kind of string being handled. Different types of content oftenimply different ways of handling them on reception. To make the codemore readable and to anticipate future changes, it is good practice touse separate conversion functions for different types of strings.

Encoding text to bytes

The reverse way, converting a Python unicode string to a Cchar*, is pretty efficient by itself, assuming that whatyou actually want is a memory managed byte string:

  1. py_byte_string = py_unicode_string.encode('UTF-8')
  2. cdef char* c_string = py_byte_string

As noted before, this takes the pointer to the byte buffer of thePython byte string. Trying to do the same without keeping a referenceto the Python byte string will fail with a compile error:

  1. # this will not compile !
  2. cdef char* c_string = py_unicode_string.encode('UTF-8')

Here, the Cython compiler notices that the code takes a pointer to atemporary string result that will be garbage collected after theassignment. Later access to the invalidated pointer will read invalidmemory and likely result in a segfault. Cython will therefore refuseto compile this code.

C++ strings

When wrapping a C++ library, strings will usually come in the form ofthe std::string class. As with C strings, Python byte stringsautomatically coerce from and to C++ strings:

  1. # distutils: language = c++
  2.  
  3. from libcpp.string cimport string
  4.  
  5. def get_bytes():
  6. py_bytes_object = b'hello world'
  7. cdef string s = py_bytes_object
  8.  
  9. s.append('abc')
  10. py_bytes_object = s
  11. return py_bytes_object

The memory management situation is different than in C because thecreation of a C++ string makes an independent copy of the stringbuffer which the string object then owns. It is therefore possibleto convert temporarily created Python objects directly into C++strings. A common way to make use of this is when encoding a Pythonunicode string into a C++ string:

  1. cdef string cpp_string = py_unicode_string.encode('UTF-8')

Note that this involves a bit of overhead because it first encodesthe Unicode string into a temporarily created Python bytes objectand then copies its buffer into a new C++ string.

For the other direction, efficient decoding support is availablein Cython 0.17 and later:

  1. # distutils: language = c++
  2.  
  3. from libcpp.string cimport string
  4.  
  5. def get_ustrings():
  6. cdef string s = string(b'abcdefg')
  7.  
  8. ustring1 = s.decode('UTF-8')
  9. ustring2 = s[2:-2].decode('UTF-8')
  10. return ustring1, ustring2

For C++ strings, decoding slices will always take the proper lengthof the string into account and apply Python slicing semantics (e.g.return empty strings for out-of-bounds indices).

Auto encoding and decoding

Cython 0.19 comes with two new directives: c_string_type andc_string_encoding. They can be used to change the Python stringtypes that C/C++ strings coerce from and to. By default, they onlycoerce from and to the bytes type, and encoding or decoding mustbe done explicitly, as described above.

There are two use cases where this is inconvenient. First, if allC strings that are being processed (or the large majority) containtext, automatic encoding and decoding from and to Python unicodeobjects can reduce the code overhead a little. In this case, youcan set the c_string_type directive in your module to unicodeand the c_string_encoding to the encoding that your C code uses,for example:

  1. # cython: c_string_type=unicode, c_string_encoding=utf8
  2.  
  3. cdef char* c_string = 'abcdefg'
  4.  
  5. # implicit decoding:
  6. cdef object py_unicode_object = c_string
  7.  
  8. # explicit conversion to Python bytes:
  9. py_bytes_object = <bytes>c_string

The second use case is when all C strings that are being processedonly contain ASCII encodable characters (e.g. numbers) and you wantyour code to use the native legacy string type in Python 2 for them,instead of always using Unicode. In this case, you can set thestring type to str:

  1. # cython: c_string_type=str, c_string_encoding=ascii
  2.  
  3. cdef char* c_string = 'abcdefg'
  4.  
  5. # implicit decoding in Py3, bytes conversion in Py2:
  6. cdef object py_str_object = c_string
  7.  
  8. # explicit conversion to Python bytes:
  9. py_bytes_object = <bytes>c_string
  10.  
  11. # explicit conversion to Python unicode:
  12. py_bytes_object = <unicode>c_string

The other direction, i.e. automatic encoding to C strings, is onlysupported for ASCII and the “default encoding”, which is usually UTF-8in Python 3 and usually ASCII in Python 2. CPython handles the memorymanagement in this case by keeping an encoded copy of the string alivetogether with the original unicode string. Otherwise, there would be noway to limit the lifetime of the encoded string in any sensible way,thus rendering any attempt to extract a C string pointer from it adangerous endeavour. The following safely converts a Unicode string toASCII (change c_string_encoding to default to use the defaultencoding instead):

  1. # cython: c_string_type=unicode, c_string_encoding=ascii
  2.  
  3. def func():
  4. ustring = u'abc'
  5. cdef char* s = ustring
  6. return s[0] # returns u'a'

(This example uses a function context in order to safely control thelifetime of the Unicode string. Global Python variables can bemodified from the outside, which makes it dangerous to rely on thelifetime of their values.)

Source code encoding

When string literals appear in the code, the source code encoding isimportant. It determines the byte sequence that Cython will store inthe C code for bytes literals, and the Unicode code points that Cythonbuilds for unicode literals when parsing the byte encoded source file.Following PEP 263, Cython supports the explicit declaration ofsource file encodings. For example, putting the following comment atthe top of an ISO-8859-15 (Latin-9) encoded source file (into thefirst or second line) is required to enable ISO-8859-15 decodingin the parser:

  1. # -*- coding: ISO-8859-15 -*-

When no explicit encoding declaration is provided, the source code isparsed as UTF-8 encoded text, as specified by PEP 3120. UTF-8is a very common encoding that can represent the entire Unicode set ofcharacters and is compatible with plain ASCII encoded text that itencodes efficiently. This makes it a very good choice for source codefiles which usually consist mostly of ASCII characters.

As an example, putting the following line into a UTF-8 encoded sourcefile will print 5, as UTF-8 encodes the letter 'ö' in the twobyte sequence '\xc3\xb6':

  1. print( len(b'abcö') )

whereas the following ISO-8859-15 encoded source file will print4, as the encoding uses only 1 byte for this letter:

  1. # -*- coding: ISO-8859-15 -*-
  2. print( len(b'abcö') )

Note that the unicode literal u'abcö' is a correctly decoded fourcharacter Unicode string in both cases, whereas the unprefixed Pythonstr literal 'abcö' will become a byte string in Python 2 (thushaving length 4 or 5 in the examples above), and a 4 character Unicodestring in Python 3. If you are not familiar with encodings, this maynot appear obvious at first read. See CEP 108 for details.

As a rule of thumb, it is best to avoid unprefixed non-ASCII strliterals and to use unicode string literals for all text. Cython alsosupports the future import unicode_literals that instructsthe parser to read all unprefixed str literals in a source file asunicode string literals, just like Python 3.

Single bytes and characters

The Python C-API uses the normal C char type to representa byte value, but it has two special integer types for a Unicode codepoint value, i.e. a single Unicode character: Py_UNICODEand Py_UCS4. Cython supports thefirst natively, support for Py_UCS4 is new in Cython 0.15.Py_UNICODE is either defined as an unsigned 2-byte or4-byte integer, or as wchar_t, depending on the platform.The exact type is a compile time option in the build of the CPythoninterpreter and extension modules inherit this definition at Ccompile time. The advantage of Py_UCS4 is that it isguaranteed to be large enough for any Unicode code point value,regardless of the platform. It is defined as a 32bit unsigned intor long.

In Cython, the char type behaves differently from thePy_UNICODE and Py_UCS4 types when coercingto Python objects. Similar to the behaviour of the bytes type inPython 3, the char type coerces to a Python integervalue by default, so that the following prints 65 and not A:

  1. # -*- coding: ASCII -*-
  2.  
  3. cdef char char_val = 'A'
  4. assert char_val == 65 # ASCII encoded byte value of 'A'
  5. print( char_val )

If you want a Python bytes string instead, you have to request itexplicitly, and the following will print A (or b'A' in Python3):

  1. print( <bytes>char_val )

The explicit coercion works for any C integer type. Values outside ofthe range of a char or unsigned char will raise anOverflowError at runtime. Coercion will also happen automaticallywhen assigning to a typed variable, e.g.:

  1. cdef bytes py_byte_string
  2. py_byte_string = char_val

On the other hand, the Py_UNICODE and Py_UCS4types are rarely used outside of the context of a Python unicode string,so their default behaviour is to coerce to a Python unicode object. Thefollowing will therefore print the character A, as would the samecode with the Py_UNICODE type:

  1. cdef Py_UCS4 uchar_val = u'A'
  2. assert uchar_val == 65 # character point value of u'A'
  3. print( uchar_val )

Again, explicit casting will allow users to override this behaviour.The following will print 65:

  1. cdef Py_UCS4 uchar_val = u'A'
  2. print( <long>uchar_val )

Note that casting to a C long (or unsigned long) will workjust fine, as the maximum code point value that a Unicode charactercan have is 1114111 (0x10FFFF). On platforms with 32bit or more,int is just as good.

Narrow Unicode builds

In narrow Unicode builds of CPython before version 3.3, i.e. buildswhere sys.maxunicode is 65535 (such as all Windows builds, asopposed to 1114111 in wide builds), it is still possible to useUnicode character code points that do not fit into the 16 bit widePy_UNICODE type. For example, such a CPython build willaccept the unicode literal u'\U00012345'. However, theunderlying system level encoding leaks into Python space in thiscase, so that the length of this literal becomes 2 instead of 1.This also shows when iterating over it or when indexing into it.The visible substrings are u'\uD808' and u'\uDF45' in thisexample. They form a so-called surrogate pair that represents theabove character.

For more information on this topic, it is worth reading the Wikipediaarticle about the UTF-16 encoding.

The same properties apply to Cython code that gets compiled for anarrow CPython runtime environment. In most cases, e.g. whensearching for a substring, this difference can be ignored as both thetext and the substring will contain the surrogates. So most Unicodeprocessing code will work correctly also on narrow builds. Encoding,decoding and printing will work as expected, so that the above literalturns into exactly the same byte sequence on both narrow and wideUnicode platforms.

However, programmers should be aware that a single Py_UNICODEvalue (or single ‘character’ unicode string in CPython) may not beenough to represent a complete Unicode character on narrow platforms.For example, if an independent search for u'\uD808' andu'\uDF45' in a unicode string succeeds, this does not necessarilymean that the character u'\U00012345 is part of that string. Itmay well be that two different characters are in the string that justhappen to share a code unit with the surrogate pair of the characterin question. Looking for substrings works correctly because the twocode units in the surrogate pair use distinct value ranges, so thepair is always identifiable in a sequence of code points.

As of version 0.15, Cython has extended support for surrogate pairs sothat you can safely use an in test to search character values fromthe full Py_UCS4 range even on narrow platforms:

  1. cdef Py_UCS4 uchar = 0x12345
  2. print( uchar in some_unicode_string )

Similarly, it can coerce a one character string with a high Unicodecode point value to a Py_UCS4 value on both narrow and wide Unicodeplatforms:

  1. cdef Py_UCS4 uchar = u'\U00012345'
  2. assert uchar == 0x12345

In CPython 3.3 and later, the Py_UNICODE type is an aliasfor the system specific wchar_t type and is no longer tiedto the internal representation of the Unicode string. Instead, anyUnicode character can be represented on all platforms withoutresorting to surrogate pairs. This implies that narrow builds nolonger exist from that version on, regardless of the size ofPy_UNICODE. See PEP 393 for details.

Cython 0.16 and later handles this change internally and does the rightthing also for single character values as long as either type inferenceis applied to untyped variables or the portable Py_UCS4 typeis explicitly used in the source code instead of the platform specificPy_UNICODE type. Optimisations that Cython applies to thePython unicode type will automatically adapt to PEP 393 at C compiletime, as usual.

Iteration

Cython 0.13 supports efficient iteration over char*,bytes and unicode strings, as long as the loop variable isappropriately typed. So the following will generate the expectedC code:

  1. cdef char* c_string = "Hello to A C-string's world"
  2.  
  3. cdef char c
  4. for c in c_string[:11]:
  5. if c == 'A':
  6. print("Found the letter A")

The same applies to bytes objects:

  1. cdef bytes bytes_string = b"hello to A bytes' world"
  2.  
  3. cdef char c
  4. for c in bytes_string:
  5. if c == 'A':
  6. print("Found the letter A")

For unicode objects, Cython will automatically infer the type of theloop variable as Py_UCS4:

  1. cdef unicode ustring = u'Hello world'
  2.  
  3. # NOTE: no typing required for 'uchar' !
  4. for uchar in ustring:
  5. if uchar == u'A':
  6. print("Found the letter A")

The automatic type inference usually leads to much more efficient codehere. However, note that some unicode operations still require thevalue to be a Python object, so Cython may end up generating redundantconversion code for the loop variable value inside of the loop. Ifthis leads to a performance degradation for a specific piece of code,you can either type the loop variable as a Python object explicitly,or assign its value to a Python typed variable somewhere inside of theloop to enforce one-time coercion before running Python operations onit.

There are also optimisations for in tests, so that the followingcode will run in plain C code, (actually using a switch statement):

  1. cpdef void is_in(Py_UCS4 uchar_val):
  2. if uchar_val in u'abcABCxY':
  3. print("The character is in the string.")
  4. else:
  5. print("The character is not in the string")

Combined with the looping optimisation above, this can result in veryefficient character switching code, e.g. in unicode parsers.

Windows and wide character APIs

Windows system APIs natively support Unicode in the form ofzero-terminated UTF-16 encoded wchar_t* strings, so called“wide strings”.

By default, Windows builds of CPython define Py_UNICODE asa synonym for wchar_t. This makes internal unicoderepresentation compatible with UTF-16 and allows for efficient zero-copyconversions. This also means that Windows builds are alwaysNarrow Unicode builds with all the caveats.

To aid interoperation with Windows APIs, Cython 0.19 supports widestrings (in the form of Py_UNICODE) and implicitly convertsthem to and from unicode string objects. These conversions behave thesame way as they do for char and bytes as described inPassing byte strings.

In addition to automatic conversion, unicode literals that appearin C context become C-level wide string literals and len()built-in function is specialized to compute the length of zero-terminatedPy_UNICODE* string or array.

Here is an example of how one would call a Unicode API on Windows:

  1. cdef extern from "Windows.h":
  2.  
  3. ctypedef Py_UNICODE WCHAR
  4. ctypedef const WCHAR* LPCWSTR
  5. ctypedef void* HWND
  6.  
  7. int MessageBoxW(HWND hWnd, LPCWSTR lpText, LPCWSTR lpCaption, int uType)
  8.  
  9. title = u"Windows Interop Demo - Python %d.%d.%d" % sys.version_info[:3]
  10. MessageBoxW(NULL, u"Hello Cython \u263a", title, 0)

Warning

The use of Py_UNICODE* strings outside of Windows isstrongly discouraged. Py_UNICODE is inherently notportable between different platforms and Python versions.

CPython 3.3 has moved to a flexible internal representation ofunicode strings (PEP 393), making all Py_UNICODE relatedAPIs deprecated and inefficient.

One consequence of CPython 3.3 changes is that len() ofunicode strings is always measured in code points (“characters”),while Windows API expect the number of UTF-16 code units(where each surrogate is counted individually). To always get the numberof code units, call PyUnicode_GetSize() directly.