Python notes - strings, unicode, encodings

From Helpful
Jump to: navigation, search
Various things have their own pages, see Category:Python. Some of the pages that collect various practical notes include:

Encoding/escaping notes

Values in HTML/XML nodes and attributes

The restrictions on valid characters vary a little with individual standards, but the below covers most:


within text nodes means using entities for of <, >, & so for example:

cgi.escape("This & That <3") == 'This &amp; That &lt;3'


Text in attributes (assuming you already added the " around them)

cgi.escape(s).replace('"','&#x22;'

(you could also use " much of the time (defined in at least HTML 3.2, HTML 2, HTML 4 and XHTML 1.0, XML 1.0, but not in everything(verify))


I like to have helper functions named
escape.nodetext()
and
escape.attr()
,

so that code is easily skim-verified to do the right thing.


Unicode:: The above does not handle them, because it assumes you (or your framework) are applying byte coding (and handling the headers) to the document as a whole. This seems to often make more sense.


percent escaping

Centers around urllib.quote() (or quote_plus(), which only differs in that it codes spaces as + instead of %20 - useful for POST data, see e.g. RFC 1866)


percent-escaping bytestrings
Escapes everything, including
/
, so like JS's encodeURIComponent(). Useful to make values safe no matter what they are.
urllib.quote(bytestr,'')
Don't escape ':', '/', ';', and '?' (imitation of JS's encodeURI, so result can still be an URL)
urllib.quote(bytestr,':/;?')</nowiki>}} 


percent-escaping unicode strings
The above with an encode('u8') first
urllib.quote(ustr.encode('utf8'),'') 
urllib.quote(ustr.encode('utf8'),':/;?')
Note that urllib.quote does not handle Unicode. Before 2.4.2 it would pass it through (which was potentially confusing), since 2.4.2 it gives KeyErrors.


percent-escaping dicts or sequences(verify) into an &'d result is easiest via urllib.urlencode()
urllib.urlencode( {'v1':'Q3=32', 'v2':'===='})      ==  'v1=Q3%3D32&v2=%3D%3D%3D%3D'
urllib.urlencode( [('v1','Q3=32') ,('v2','====')])  ==  'v1=Q3%3D32&v2=%3D%3D%3D%3D'
note: mostly just the wrapping you would write (around quote_plus(), not quote(), but details...)


  • values in POST body
    application/x-www-form-urlencoded
    style
is actually the 'parts of a dict/list &'d toether' case, i.e. urllib.urlencode()

SQL encoding

tl;dr: don't do this yourself, use the DB-API2 marker/parameter functionality (see PEP 249)


Other notes

  • 7-bit safe: Some interesting escaping-like codecs for .endode()/.decode():
    • MIME Quoted Printable: 'quopri_codec'
      • "delimiter = @ \ndfgd".encode('quopri_codec') == 'delimiter=20=3D=20@=20\ndfgd'
    • Hex string (two-digit-per-byte): 'hex_codec'
      • '/var/www'.encode('hex_codec') == '2f7661722f777777'
    • Base64: 'base64_codec'
      • '/var/www'.encode('base64_codec') == 'L3Zhci93d3c=\n'


More on character codings:

  • Common character sets/encodings: utf8, ascii (point being that it raises error when trying to encode ≥U+80), iso-8859-1 (a.k.a. latin_1), cp1250 (windows in western europe; is latin1 plus some characters in a range latin1 does not define)
    • You may be interested in chardet.


  • The "UnicodeEncodeError: 'ascii' codec can't encode character ..." error comes from an implicit call to sys.getdefaultencoding(), which is 'ascii' by default and site-wide, so is not the thing to change if you want portable code.
  • If you want console printing without errors, you have a few options, including:
    • s.encode('utf8') (will look like unicode on properly configured consoles, otherwise garbled)
    • s.encode('unicode-escape') (places \u2222-style escapes in string)
    • repr(s) - much like the latter


To avoid having codec conversions throw exceptions, you can add 'ignore' as a second parameter - though you should know that this means you will garble the data, so you should not do this just 'to make errors go away.' In particular in multiple-and-variable-byte encodings like UTF8 you may see many bytes being consumed even if they do not lead to a valid character. Trying to decode decode data that isn't UTF8 as UTF8 should be avoided.

Unicode notes

Unicode and UTF

The "unichr() arg not in range ..." error reports either 0x10000 (narrow builds) or 0x110000 (wide builds).

The latter refers to the fact that the unicode character cap is (currently) at 0x10ffff - trying to create codepoints higher than that makes no sense. The former is more python-specific:


There are two flavours of unicode representation, chosen when building python. One is called narrow, apparently referring to UCS2, the other wide, apparently referring to UCS4.

I am not sure about this; I have seen mention of UTF-16 (and the use of surrogate pairs) and UTF-32, but narrow unicode builds do not support codepoints above 0xffff, as the unichr error illustrates.

Note that the windows (and mac?(verify)) build is narrow by default, while unices seem to default to be wide. You can build them differently.



Some common unicode-related errors

UnicodeDecodeError: 'rawunicodeescape' codec can't decode bytes in position something

...means you have an invalid unicode codepoint (\u) in an u""-type (or ur"")-type string.

It seems that for ur strings, the choice was made to allow unicode codepoint entering, disallowing entering a literal '\u' (verify)).


Another reason is trying to use a character beyond the 16-bit limit in a narrow unicode built python. (this one's somewhat more obvious from the specific error's elaboration)


UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position something: ordinal not in range(128)

Usually means that what you have is an already-encoded bytestring (likely to be UTF8 if the byte it trips over is in 0xd0 .. 0xd4), are assuming it's a unicode string, and encode() it again.


This may mean that the piece of code may gets both unicode and bytecode strings, and doesn't deal with both equally well.


Side note: if it were a unicode string, it would happily encode it assuming they were codepoints. From that perspective, it's relevant that

'\xd0'.encode('u8')

means something different from:

u'\xd0'.encode('u8')

Pre-Py3K

Bytestrings are of type() str, unicode ones of type() unicode.


Syntax

Basic unicode strings with codepoints are written like u'Bla \u2222'. Low codepoints can additionally be entered and will be displayed as e.g. \x1a. Can be combined with raw/regexp form: ur.


When they contain just ASCII, you can convert back and forth between unicode and str:

span style="color: #483d9b;">'C''C')

When they contain high bytes/codepoints, it is ambiguous what the above would mean. The usual case is probably encode()ing and decode()ing?


Things you should know about python unicode:

  • Prepending an u to a string literal, e.g. u'', means it is treates as an unicode string and, codepoints are referenced like \u2222.
  • In python, strings without unicode (which can work as byte strings) and unicode strings are interchangable. That is, weak typing and semantics will ensure strings will become unicode when necessary, so you can do: 'Normal' + u' \u2222 ', which will be u'Normal \u2222 '.
  • Prepending an r ro a string literal, e.g. r'' makes it a regexp or raw string, meaning backslashing does not apply. This e.g. means that when you want to store e.g. the two characters \1 in a string you don't have to write '\\1', but you can do r'\1'. This makes particularly regular expressions a bit more readable.
  • For many types of output (including 'print'), unicode strings should be .encode('utf8')-d before written.


Double encoding

If you somehow managed to get a bytestring as unicode codepoint-for-byte (literally, as in 0xC3 ws converted to U+C3, which wouldn't normally happen) I've used quick hacks based on:

span style="color: #483d9b;">'\xc3''\xc3') )

This is probably only useful to detect and resolve double encoding.


Behaviour

The gist is that python strings automatically become a unicode string type when necessary. If you want to read something in some encoding you'll need to decode it from that into this python-internal format (UCS, it seems), while to print or store them you'll often have to convert them to some encoding.

For example:

  • to read from e.g. a web page, check what its encoding is and decode it, usually pagedata.decode('iso-8859-1') or .decode('utf8').
  • For pages and files you can also use streamreaders (see [1]).
  • to store in a database (or print to a utf console) you can often use .encode('utf8')
  • to print, python checks the shell encoding, which may well be ascii - in which case unicode converts often fail; conversions prefer correct handling over the 'eh, it was a good guess' other systems often do. You can either set the encoding to something different - or make it ignore character data it cannot convert, using e.g. .encode('ascii','ignore')


See also the available encodings. For related tutorials, see: See [2], [3].


HTML entities to unicode

See python snippets.


Py3K

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


See also