Python notes - strings, unicode, encodings

From Helpful
Jump to: navigation, search
Various things have their own pages, see Category:Python. Some of the pages that collect various practical notes include:

Encoding/escaping notes

Values in HTML/XML nodes and attributes

The restrictions on valid characters vary a little with individual standards, but the below covers most:


within text nodes means using entities for of <, >, & so for example:

cgi.escape("This & That <3") == 'This &amp; That &lt;3'


Text in attributes (assuming you already added the " around them)

cgi.escape(s).replace('"','&#x22;'

(you could also use " much of the time (defined in at least HTML 3.2, HTML 2, HTML 4 and XHTML 1.0, XML 1.0, but not in everything(verify))


I like to have helper functions named
escape.nodetext()
and
escape.attr()
,

so that code is easily skim-verified to do the right thing.


Unicode:: The above does not handle them, because it assumes you (or your framework) are applying byte coding (and handling the headers) to the document as a whole. This seems to often make more sense.


percent escaping

Centers around urllib.quote() (or quote_plus(), which only differs in that it codes spaces as + instead of %20 - useful for POST data, see e.g. RFC 1866)


percent-escaping bytestrings
Escapes everything, including
/
, so like JS's encodeURIComponent(). Useful to make values safe no matter what they are.
urllib.quote(bytestr,'')
Don't escape ':', '/', ';', and '?' (imitation of JS's encodeURI, so result can still be an URL)
urllib.quote(bytestr,':/;?')</nowiki>}} 


percent-escaping unicode strings
The above with an encode('u8') first
urllib.quote(ustr.encode('utf8'),'') 
urllib.quote(ustr.encode('utf8'),':/;?')
Note that urllib.quote does not handle Unicode. Before 2.4.2 it would pass it through (which was potentially confusing), since 2.4.2 it gives KeyErrors.


percent-escaping dicts or sequences(verify) into an &'d result is easiest via urllib.urlencode()
urllib.urlencode( {'v1':'Q3=32', 'v2':'===='})      ==  'v1=Q3%3D32&v2=%3D%3D%3D%3D'
urllib.urlencode( [('v1','Q3=32') ,('v2','====')])  ==  'v1=Q3%3D32&v2=%3D%3D%3D%3D'
note: mostly just the wrapping you would write (around quote_plus(), not quote(), but details...)


  • values in POST body
    application/x-www-form-urlencoded
    style
is actually the 'parts of a dict/list &'d toether' case, i.e. urllib.urlencode()

SQL encoding

tl;dr: don't do this yourself, use the DB-API2 marker/parameter functionality (see PEP 249)


Other notes

  • 7-bit safe: Some interesting escaping-like codecs for .endode()/.decode():
    • MIME Quoted Printable: 'quopri_codec'
      • "delimiter = @ \ndfgd".encode('quopri_codec') == 'delimiter=20=3D=20@=20\ndfgd'
    • Hex string (two-digit-per-byte): 'hex_codec'
      • '/var/www'.encode('hex_codec') == '2f7661722f777777'
    • Base64: 'base64_codec'
      • '/var/www'.encode('base64_codec') == 'L3Zhci93d3c=\n'


More on character codings:

  • Common character sets/encodings:
    • utf8,
    • ascii (point being that it raises error when trying to encode ≥U+80)
    • iso-8859-1 (a.k.a. latin_1)
    • cp1250 (windows in western europe; is latin1 plus some characters in a range latin1 does not define)



  • The "UnicodeEncodeError: 'ascii' codec can't encode character ..." error
    • comes from an implicit call to sys.getdefaultencoding()
    • which often 'ascii' by default
    • but may be situation-dependent (verify)
and site-wide, so is not the thing to change if you want portable code
  • If you want console printing without errors, you want to explicitly spit out bytes. A few options:
    • assume you're outputting to something that shows UTF8, accept that it looks wrong otherwise
      • s.encode('utf8')
    • stick to ASCII, accept that unicode gets mentioned as codepoints instead of shown as the characters in there
      • s.encode('unicode-escape') Make unicode \u2222-style escapes in string)
      • repr(s) - much like the latter



To avoid having such codec conversions throw exceptions, you can add 'ignore' as a second parameter - though you should know that this means you will garble the data, so you should not do this just 'to make errors go away.'

In particular in multiple-and-variable-byte encodings like UTF8 you may see many bytes being consumed even if they do not lead to a valid character. Trying to decode decode data that isn't UTF8 as UTF8 should be avoided.

Unicode notes

Unicode and UTF

The "unichr() arg not in range ..." error reports either 0x10000 (narrow builds) or 0x110000 (wide builds).

The latter refers to the fact that the unicode character cap is (currently) at 0x10ffff - trying to create codepoints higher than that makes no sense. The former is more python-specific:


There are two flavours of unicode representation, chosen when building python. One is called narrow, apparently referring to UCS2, the other wide, apparently referring to UCS4.

I am not sure about this; I have seen mention of UTF-16 (and the use of surrogate pairs) and UTF-32, but narrow unicode builds do not support codepoints above 0xffff, as the unichr error illustrates.

Note that the windows (and mac?(verify)) build is narrow by default, while unices seem to default to be wide. You can build them differently.



Some common unicode-related errors

UnicodeDecodeError: 'rawunicodeescape' codec can't decode bytes in position something

...means you have an invalid unicode codepoint (\u) in an u""-type (or ur"")-type string.

It seems that for ur strings, the choice was made to allow unicode codepoint entering, disallowing entering a literal '\u' (verify)).


Another reason is trying to use a character beyond the 16-bit limit in a narrow unicode built python. (this one's somewhat more obvious from the specific error's elaboration)


Unicode and the shell

tl;dr:

  • you can print unicode, there is implicit conversion
  • Never count on implicit conversion.
  • explicit conversion is easy enough


In languages that do unicode, the best way of handling unicode is:

  • use Unicode internally
  • decode what you receive (at application edge)
  • encode what you send (at application edge)

You're already used to doing this in file IO and pipes.

When printing, you generally want either to do explicit conversion, or for something else to completely take charge of this.


In the shell you can get away with
print
ing unicode,

because the thing-that-takes-charge is implied.

But also context-dependent: it depends on whether python detects that it's running in a shell. So code that prints unicode to the shell may fail when redirected to a pipe, because you didn't set up the conversion more explicitly.


Perhaps the clearest is to show:

# python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)' 
('UTF-8', 'UTF-8')
# python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)' | cat
('UTF-8', None)
# echo foo | python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'
(None, 'UTF-8')


Note that you can alter these defaults via the environment, in PYTHONIOENCODING

# export PYTHONIOENCODING=utf16
python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)' 
('utf16', 'utf16')
# python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)' | cat
('utf16', 'utf16') 
# echo foo | python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'
('utf16', 'utf16')



Also relevant is the site coding, see
sys.getdefaultencoding()
,

but you should both assume that this is generally 'ascii' and that altering it too fragile, because other things either assume that, or change it.

Yes, most cases it's happy, but once this is the source of your bug, you will be very unhappy, exactly because it's not really under your control.


UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position something: ordinal not in range(128)

Read the previous section.

Most likely you
print
ed a unicode string, and there is no implied encoding.

Do the conversion explicitly. Either with each string, with a wrapper / at postponed-output time, or with something like:

sys.stdout = codecs.getwriter('utf8')(sys.stdout)


Pre-Py3K

Bytestrings are of type() str, unicode ones of type() unicode.


Syntax

Basic unicode strings with codepoints are written like u'Bla \u2222'. Low codepoints can additionally be entered and will be displayed as e.g. \x1a. Can be combined with raw/regexp form: ur.


When they contain just ASCII, you can convert back and forth between unicode and str:

unicode('C')
str(u'C')

When they contain high bytes/codepoints, it is ambiguous what the above would mean. The usual case is probably encode()ing and decode()ing?


Things you should know about python unicode:

  • Prepending an u to a string literal, e.g. u'', means it is treates as an unicode string and, codepoints are referenced like \u2222.
  • In python, strings without unicode (which can work as byte strings) and unicode strings are interchangable. That is, weak typing and semantics will ensure strings will become unicode when necessary, so you can do: 'Normal' + u' \u2222 ', which will be u'Normal \u2222 '.
  • Prepending an r ro a string literal, e.g. r'' makes it a regexp or raw string, meaning backslashing does not apply. This e.g. means that when you want to store e.g. the two characters \1 in a string you don't have to write '\\1', but you can do r'\1'. This makes particularly regular expressions a bit more readable.
  • For many types of output (including 'print'), unicode strings should be .encode('utf8')-d before written.


Double encoding

If you somehow managed to get a bytestring as unicode codepoint-for-byte (literally, as in 0xC3 ws converted to U+C3, which wouldn't normally happen) I've used quick hacks based on:

unichr( ord('\xc3') )
chr( ord(u'\xc3') )

This is probably only useful to detect and resolve double encoding.


Behaviour

The gist is that python strings automatically become a unicode string type when necessary. If you want to read something in some encoding you'll need to decode it from that into this python-internal format (UCS, it seems), while to print or store them you'll often have to convert them to some encoding.

For example:

  • to read from e.g. a web page, check what its encoding is and decode it, usually pagedata.decode('iso-8859-1') or .decode('utf8').
  • For pages and files you can also use streamreaders (see [1]).
  • to store in a database (or print to a utf console) you can often use .encode('utf8')
  • to print, python checks the shell encoding, which may well be ascii - in which case unicode converts often fail; conversions prefer correct handling over the 'eh, it was a good guess' other systems often do. You can either set the encoding to something different - or make it ignore character data it cannot convert, using e.g. .encode('ascii','ignore')


See also the available encodings. For related tutorials, see: See [2], [3].


HTML entities to unicode

See python snippets.


Py3K

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


See also