Python notes - strings, unicode, encodings

From Helpful
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Syntaxish: syntax and language · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Tasky: Concurrency (threads, processes, more) · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time


Notebooks

speed, memory, debugging, profiling · Python extensions · semi-sorted

Encoding/escaping notes

Values in URLs, values in HTML/XML

See Escaping_and_delimiting_notes#Python


SQL encoding

tl;dr: don't do this yourself, use a database library that follows the DB-API2 so does it for you, marker-parameter style

Other notes

  • 7-bit safe: Some interesting escaping-like codecs for .endode()/.decode():
    • MIME Quoted Printable: 'quopri_codec'
      • "delimiter = @ \ndfgd".encode('quopri_codec') == 'delimiter=20=3D=20@=20\ndfgd'
    • Base64: 'base64_codec'
      • '/var/www'.encode('base64_codec') == 'L3Zhci93d3c=\n'
    • Hex string (two-digit-per-byte): 'hex_codec'
      • '/var/www'.encode('hex_codec') == '2f7661722f777777'


More on character codings:

  • Common character sets/encodings:
    • utf8,
    • ascii (point being that it raises error when trying to encode ≥U+80)
    • iso-8859-1 (a.k.a. latin_1)
    • cp1250 (windows in western europe; is latin1 plus some characters in a range latin1 does not define)



  • The "UnicodeEncodeError: 'ascii' codec can't encode character ..." error
often means something wanting to show a string on your console, so implicitly asking for sys.getdefaultencoding(), which is often 'ascii' by default - though may be situation-dependent (verify)
and site-wide, so is not the thing to change if you want portable code.
  • If you want console printing without errors, you want to explicitly spit out bytes. A few options:
    • assume you're outputting to something that shows UTF8, accept that it looks wrong otherwise
      • s.encode('utf8')
    • stick to ASCII, accept that unicode gets mentioned as codepoints instead of shown as the characters in there
      • s.encode('unicode-escape') Make unicode \u2222-style escapes in string)
      • repr(s) - much like the latter



To avoid having such codec conversions throw exceptions, you can add 'ignore' as a second parameter - though you should know that this means you will garble the data, so you should not do this just 'to make errors go away.'

In particular in multiple-and-variable-byte encodings like UTF8 you may see many bytes being consumed even if they do not lead to a valid character. Trying to decode decode data that isn't UTF8 as UTF8 should be avoided.

Unicode notes

UCS or UTF?

There are two flavours of unicode representation, chosen when building pythons since 2.2 (See also PEP261[1])

Build options will call these UCS2 and UCS4, or narrow and wide.


Note that that the standard library has good-enough handling of UTF16 surrogates that you might as well think of UCS2 as an UTF16 implementation. [2], particularly if you mostly just passing strings around, because encode and decode() are pretty clever about UTF. Still, when you write unicode manipulation functions you will will want to read up a little more.

The narrow/wide distinction is still there in py3


It seems *nix builds are typically wide.

On windows, it looks like py2 builds were often narrow (probably relating to UTF16 windows interfaces), and py3 builds are often wide.(verify)

Not sure about OSX.



For example, \U escapes for codepoints above U+FFFF will generate 2-codepoint strings (surrogate pairs) on narrow builds, 1-codepoint strings on wide builds.

To steal an example from [3]

On narrow builds:

>>> sys.maxunicode
65535
>>> a = u'\N{MAHJONG TILE GREEN DRAGON}' 
>>> a
u'\U0001f005'
>>> len(a)
2
>>> a[0], a[1]
(u'\ud83c', u'\udc05')
>>> [hex(ord(c)) for c in a.encode('utf-16be')]
['0xd8', '0x3c', '0xdc', '0x5']

On wide builds:

>>> sys.maxunicode
1114111
>>> a = u'\N{MAHJONG TILE GREEN DRAGON}' 
>>> a
u'\U0001f005'
>>> len(a)
1
>>> a[0]
u'\U0001f005'
>>> [hex(ord(c)) for c in a.encode('utf-16be')]
['0xd8', '0x3c', '0xdc', '0x5']


Also relevant is that UTF8 is aware of surrogates. That means that on both narrow and wide builds:

>>> u'\ud83c\udc05'.encode('utf8')
'\xf0\x9f\x80\x85'
>>> u'\U0001f005'.encode('utf8')    
'\xf0\x9f\x80\x85'


Python3 is a little more strict in a few places, e.g. its UTF8 handling[4]. For example, it won't allow surrogates in UTF8 accidentally:

u'\ud83c'.encode('utf8') + u'\udc05'.encode('utf8')
# on py2 it would be '\xed\xa0\xbc\xed\xb0\x85'
# on py3 it would be a UnicodeEncodeError, and b'\xed\xa0\xbc\xed\xb0\x85'.decode('u8') a UnicodeDecodeError

...but it you really must, look at surrogatepass



The "unichr() arg not in range ..." error reports either

  • 0x110000 (wide builds).
referring to the unicode character cap at 0x10ffff
  • 0x10000 (narrow builds)
due to storing in 16-bit characters (UCS/UTF-16, see notes below),


Some common unicode-related errors

UnicodeDecodeError: 'rawunicodeescape' codec can't decode bytes in position something

...means you have an invalid unicode codepoint (\u) in an u""-type (or ur"")-type string.

It seems that for ur strings, the choice was made to allow unicode codepoint entering, disallowing entering a literal '\u' (verify)).


Another reason is trying to use a character beyond the 16-bit limit in a narrow unicode built python. (this one's somewhat more obvious from the specific error's elaboration)


Unicode and the shell

tl;dr:

  • you can print unicode, there is implicit conversion
  • However, counting on implicit conversion is often a bad idea (brings in unknowns)
  • explicit conversion is easy enough


In languages that do unicode, the best way of handling unicode is:

  • use Unicode internally
  • decode what you receive (at application edge)
  • encode what you send (at application edge)


You're already used to doing this in file IO (and pipes to a degree).

When printing, you generally want either to do explicit conversion, or for something else to completely take charge of this.


In the shell you can get away with printing unicode, because the thing-that-takes-charge is implied.

But also context-dependent: it depends on whether python detects that it's running in a shell. So code that prints unicode to the shell may fail when redirected to a pipe, because you didn't set up the conversion more explicitly.


Perhaps the clearest is to show:

# python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)' 
('UTF-8', 'UTF-8')
# python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)' | cat
('UTF-8', None)
# echo foo | python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'
(None, 'UTF-8')


Note that you can alter these defaults via the environment, in PYTHONIOENCODING

# export PYTHONIOENCODING=utf16
python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)' 
('utf16', 'utf16')
# python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)' | cat
('utf16', 'utf16') 
# echo foo | python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'
('utf16', 'utf16')



Also relevant is the site coding, see sys.getdefaultencoding(), but you should both assume that this is generally 'ascii' and that altering it too fragile, because other things either assume that, or change it.

Yes, most cases it's happy, but once this is the source of your bug, you will be very unhappy, exactly because it's not really under your control.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position something: ordinal not in range(128)

Read the previous section.

Most likely you printed a unicode string, and there is no implied encoding.

Do the conversion explicitly. Either with each string, with a wrapper / at postponed-output time, or with something like:

sys.stdout = codecs.getwriter('utf8')(sys.stdout)


Pre-Py3K

Bytestrings are of type() str, unicode ones of type() unicode.


Syntax

Basic unicode strings with codepoints are written like u'Bla \u2222'. Low codepoints can additionally be entered and will be displayed as e.g. \x1a. Can be combined with raw/regexp form: ur.


When they contain just ASCII, you can convert back and forth between unicode and str:

unicode('C')
str(u'C')

When they contain high bytes/codepoints, it is ambiguous what the above would mean. The usual case is probably encode()ing and decode()ing?


Things you should know about python unicode:

  • Prepending an u to a string literal, e.g. u'', means it is treates as an unicode string and, codepoints are referenced like \u2222.
  • In python, strings without unicode (which can work as byte strings) and unicode strings are interchangable. That is, weak typing and semantics will ensure strings will become unicode when necessary, so you can do: 'Normal' + u' \u2222 ', which will be u'Normal \u2222 '.
  • Prepending an r ro a string literal, e.g. r'' makes it a regexp or raw string, meaning backslashing does not apply. This e.g. means that when you want to store e.g. the two characters \1 in a string you don't have to write '\\1', but you can do r'\1'. This makes particularly regular expressions a bit more readable.
  • For many types of output (including 'print'), unicode strings should be .encode('utf8')-d before written.


Double encoding

If you somehow managed to get a bytestring as unicode codepoint-for-byte (literally, as in 0xC3 ws converted to U+C3, which wouldn't normally happen) I've used quick hacks based on:

unichr( ord('\xc3') )
chr( ord(u'\xc3') )

This is probably only useful to detect and resolve double encoding.


Behaviour

The gist is that python strings automatically become a unicode string type when necessary. If you want to read something in some encoding you'll need to decode it from that into this python-internal format (UCS, it seems), while to print or store them you'll often have to convert them to some encoding.

For example:

  • to read from e.g. a web page, check what its encoding is and decode it, usually pagedata.decode('iso-8859-1') or .decode('utf8').
  • For pages and files you can also use streamreaders (see [5]).
  • to store in a database (or print to a utf console) you can often use .encode('utf8')
  • to print, python checks the shell encoding, which may well be ascii - in which case unicode converts often fail; conversions prefer correct handling over the 'eh, it was a good guess' other systems often do. You can either set the encoding to something different - or make it ignore character data it cannot convert, using e.g. .encode('ascii','ignore')


See also the available encodings. For related tutorials, see: See [6], [7].


HTML entities to unicode

See python snippets.


Py3K

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


See also