Python notes - syntax and language - changes and py2/3
Syntaxish: syntax and language · type stuff · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency · exceptions, warnings
IO: networking and web · filesystem Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly
Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML speed, memory, debugging, profiling · Python extensions · semi-sorted |
On preparing for and doing the conversion to py3
It is usually easy (but sometimes awkward) to write code that is good for both ~py2.7 and py3k, which was nice preparation for the shift.
2to3 can do most of the syntax for you, but it cannot really think about how you mixed bytestrings and unicode strings - such are the drawback of weak typing when you change your entire language's semantics.
Also, various third party libraries decided to move from bytes by default to unicode by default -- which is effectively an API change it cannot know about.
So you will have to review your code.
At one point I changed all my code to run under py3, to force myself to do all conversions over a weekend.
This was work, but the breaking changes were actually mostly boring, most changes being:
- sorting, where I wasn't using key= yet
- the equivalent key= form is typically obvious
- bytes/str around the edges of the app, e.g.
- file IO
- web framework
- database
- network library renames
Differences in syntax and behaviour in 2.6, 3k
NOTE: These are notes to self, basically "things that bit me that I looked up"
Syntax differences
- py2.6: as became a reserved word
- py2.6: with became a reserved word
- py3: print() is a function
- so print("stuff") instead of print "stuff"
- and some tricks (like the comma to omit a newline) changed
- more details below
- py3: no more backticks, use repr() explicitly
- py3: int and long are now (syntactically) the same thing, so you can use int everywhere
- py3: Division
- In py2
- /
- coerces to float if it involves a float, e.g. 1/2.0==0.5 (it's not unusual to put float() around variable arguments to ensure you get float division)
- stays an int if both arguments were int, e.g. 1/2==0
- //
- integer division, basically. E.g. 1//2==0
- result type is still float if involves a float but the behaviour is no different, e.g. 1//2.0==0.0
- /
- In py3
- /
- always float division, e.g. 1/2==0.5
- //
- as in py2
- /
- In py2
- Exceptions now always use as , e.g. except ErrorName as var:
- as was possible for a bunch of versions, but the older except ErrorName, var: was also still allowed
- for more details: http://www.python.org/dev/peps/pep-3110/
- py3: no more has_key
- The in keyword should handle all cases.
- py3: no more g.next()
- It was renamed to g.__next__() because it's considered an internal method.
- your code should probably say next(g) rather than either of those.
- py3: no more <>, use != instead
- py3: you can no longer mix tabs and spaces. Probably a good thing.
- py3.6: allows type annotation (see below)
Behaviour/implementation changes
- py3: list comprehensions no longer leak their iteration variables
- because they are actually functions with their own scope now
- py3: more things return an iterator or view, rather than a list (which is nice)
- dict.keys(), dict.items() and dict.values() (and dict.iterkeys(), dict.iteritems() and dict.itervalues() are gone)
- map() and filter() return iterables
- range() acts like the older xrange (and xrange is gone)
- zip returns an iterator
- py2 had .next() for iterators, py3 has a next() in syntax
- py3: comparisons between types is slightly stricter
- less likely to do strange things without warning, and some related details (no maxint, no L in int's repr
- TODO: detail
- py3: bytestrings / unicode strings are their own topic, see below
- buffer, memoryview and such changed.
- largely compatible, but some details have changed
- if you use them, read up to be sure
- py3: has new-style classes only (that is, inherit from object) (no more classic classes)
- which most people were using already
See also:
- 2to3 is a tool to parse code and show suggested change (as a diff)
- http://docs.python.org/3.0/whatsnew/3.0.html
- http://wiki.python.org/moin/Python3000
- http://wiki.python.org/moin/FutureProofPython
- http://oakwinter.com/code/porting-setuptools-to-py3k/
Networking changes
Various things have been reorganized, but are largely unchanged::
- httplib has been moved to http.client
- urllib2 has been moved to mostly into urllib.request and urllib.error
- as before, consider things like requests - not standard library, but convenient.
- urlparse became urllib.parse
- urlencode, quote, quote_plus, unquote moved from urllib to urllib.parse
- cgi.escape became html.escape
TODO: I think timeout details were cleaned up(verify)
See also Python_usage_notes/Networking_and_web#URL_fetching
Py2 versus Py3 string stuff
If you're new to python, maybe just learn the py3 way, and don't read this - there's no point in confusing yourself with both parts of a conflicting change.
Types and literals
- The distinctions in py2 and py3 are the same
- there's a bytestring
- there's a unicode string
- What has changed is their names, and which is the default
- str in py2 meant bytestring, str in py3 means unicode
- put another way, in py2:
- str is bytestring type, a literal looks like ''
- unicode is unicode string type, a literal looks like u''
- and in py3:
- bytes is bytestring type, a literal looks like b''
- str is unicode string type, a literal looks like ''
Changes in code - conversions
- unicode from integer codepoints
- py2: unichr(i)
- because chr() returned a bytestring so could only deal with 0..255
- py3: chr(i)
- bytestring from integer(s)
- py2: chr(i)
- py3: bytes([i])
- If you want "bytes from integers" with syntax that that works in both py2 and py3
- you will find that bytes([i]) is valid code in py2 but does something else
- there's a duct tape fix in bytes(bytearray([i])), though note it won't be the most efficient (bytearray, since ge;py2.6, is a mutable counterpart to bytes)
text conversions (bytestring-unicode)'
- bytes to unicode:
- py2: ''.decode(encoding)
- py3: b''.decode(encoding)
- unicode to bytes:
- py2: u''.encode(encoding)
- py3: ''.encode(encoding)
Note that
- converions during initialization are also a thing, e.g.
- initialize a bytes object from unicode and an encoding (seems equivalent to str.encode(verify)), e.g. bytes('something', 'utf-8') == b'something'
- initialize a unicode object from bytes and an encoding: (seems equivalent to bytes.decode(verify)), e.g. str(b'something', 'utf-8') == 'something'
- py2 let you .encode() a bytestring (seemed intended for codepage stuff?), this was a somewhat confusing exception,
and py3 does not allow you to .encode a bytes object.
- py2 also allowed some byte-to-byte codings, that are now only available via the codecs module:
byte conversions
codecs's codecs.encode() and codecs.decode(), aside from text encodings, also includes bytes-to-bytes things (found only here), including:
- 'hex' / 'hex_codec'
- 'base64' (Base64)
- 'bz2'
- 'zlib'
- 'uu' / 'uu_codec' (UUEncoding)
- 'quopri' / 'quopri_codec' (quoted printable)
Iterating
py2:
list('1234') == ['1', '2', '3', '4'] list(u'1234') == [u'1', u'2', u'3', u'4']
py3:
list(b'1234') == [49, 50, 51, 52] list('1234') == ['1', '2', '3', '4']
Coercion
py2:
- allowed coercing bytestrings to unicode e.g. u'\u12d6' + b'foo' == u'\u13d6foo'
py3:
- the same raises a TypeError
- forces you to be explicit, e.g. doing an encode or decode to work on the same type
While that leads to a little more typing in py3, it's also less fuzzy on responsibilities and less sensitive to typing errors
It also makes it harder to write functions that work on both bytestrings and unicode strings, but that's arguably sort of the point.
Brackets
- python2 allowed print without brackets
- ...which was an almost-singular exception in terms of syntax
- python3 makes it print(), like all other functions. It's a syntax error not to
- it also has some keyword arguments:
- end: string appended after the last value, default a newline (allows imitating the py2-comma-at-the-end)
- sep: string inserted between values (default a space)
- file: output to a file-like object (defaults to sys.stdout)
- flush: whether to forcibly flush the stream (default false)
print without newline
py2:
print 'foo',
py3:
print('foo', end='')
If you want something that works in both py2 and py2, use sys.stdout.write()
Sequences and non-string objects
in both py2 and py3, non-string objects will basically be asked for their .__str__()
Note that
- py2 with bracket would basically be interpreted as a tuple argument
print(1,2,3,4)
- py3 would interpret the same as multiple (non-keyword) arguments -- but typically amount to the same output
print(1,2,3,4)
in py2, the separator was a single space(verify), in py3, it can be specified:
print(1,2,3,4, sep='\t')
in py2, output was to sys.stdout, in py3 you can specify any file-like object
print("WARNING" file=sys.stderr)
- e.g. for stderr means you don't have to use its write() instead of print
py2, explicit flushing meant calling sys.stdout.flush(), in py3 you can do:
print(' ', flush=True)
You can get py3 style print() in py2, since py2.6(verify) via from __future__ import print_function
Unsorted:
- ≥py2.6 can do from __future__ import unicode_literals (verify)
- meaning what exactly?
- python2 added bytes as an alias for str around 2.7
- meaning what exactly?
- some easing of transition:
- ≥py2.something accepts b and bytes, aliased to its str type (bytestring)
- ≥py3.3 accepts u as an alias for its str (unicode)
Keep in mind that print is geared towards printing (unicode) text,
and puts everything through io.TextIOBase
If you want to output raw bytes, you'll basically need sys.stdout.write instead.
Console and locale details
Remember that print is IO that is specifically aimed at the console, so its behaviour depends on the environment, and it comes with more wrapping code than most other IO.
It will, for example, convert to the locale -- which could fail if that's a codepage or 'ascii'
sys.stdout.encoding is what is set for that specific stream
- CHECK: on windows this is now UTF-8 (previously was codepages, behaviour you can still get with PYTHONLEGACYWINDOWSSTDIO ? TODO: read https://www.python.org/dev/peps/pep-0528/
- on unices this is based on your locale ([1][2]), which makes sense for console
- on a lot of locales it amounts to UTF-8 so will work well and combine well -- but more because of convention, and you may not want to count on this
- if it ends up being 'ascii', it's probably the main reason you've seen "'ascii' codec can't encode characters in position ...: ordinal not in range"
- ...so there are use cases where you want to force it to something (probably UTF-8), like piping programs together.
- [3] overrides stdin/stdout/stderr
- you can't set it from code - setting it throws an exception http://www.macfreek.nl/memory/Encoding_of_Python_stdout#Overriding_the_Encoding_of_stdout_or_stderr
- locale.getpreferredencoding() (for unix-style locales)
- ...as an indication of what the console uses, but it's basically used for that already
sys.getdefaultencoding()
- used for implicit encodings (so more in py2), as a system-wide default and nothing better implied by context
- but was 'ascii' on py2 unless overridden -- and there are good (breaky) reasons you never want to do that [4]
- seems to often be utf-8 in py3?
sys.getfilesystemencoding()
- intended for filenames, command-line arguments, environment variables
- often utf-8?
https://pythonhosted.org/kitchen/unicode-frustrations.html
-->
Paths
pathlib
You might like pathlib (≥py3.4), which...
- is potentially a little less though around portability
- hides some (but not all) of the ugly and annoying cases (though still requiring you to know that and roughly why these edge cases exist)
- it does case insensitive matching where it applies (windows)
- is a little more expressive, and some operations take less typig.
A Path is an object
- there's some abstraction layer with a tree
- mostly because PosixPath and WindowsPath are a little different in details -- but you probably don't want to instantiate those (if you want filesystem code that is portable)
- if you only care about path logic and not an actual filesystem, you can use PurePath / PureWindowsPath and PurePosixPath
- str() makes it a string path (in native form), which you should be able to pass to anything that expects a string path
- (most of the standard library understands pathlib objects)
- (relatedly, note that some functions may act differently with paths as str versus path as bytes)
- in windows, you may like raw strings so that backslashes don't need to be escaped(verify)
pathlib.Path(r'C:\file.txt') # rather than pathlib.Path('C:\\file.txt')
For example
pathlib.Path().home().is_dir() == True
list( pathlib.Path().home().glob('.*') )
# path joins are a little less typing
pathlib.Path.home().joinpath('test.txt')
pathlib.Path.home() / 'test.txt' # (overloaded operator)
p = pathlib.Path.home().joinpath('test.txt')
# varied filename parsing (basename, stem, suffix(es)) is now attributes instead of specific calls
[str(p.parent), p.stem, p.suffix] == ['/home/me', 'test', '.txt']
# some pathname manipulation is also less typing
pathlib.PurePath('test.txt').with_suffix('.log') == Path('test.log')
pathlib.PurePath('test.txt').with_name('foo') == Path('foo')
....and centralizes a bunch of things, previously more spread around os or os.path module
drawing in a bunch of filesystem interaction
- .iterdir()
- .glob() on paths that represent directories
p = pathlib.Path.home().glob('*.dat')
- .rglob() is a recursive variant
list( pathlib.Path.home().rglob('*.txt') ) # "all text files under my homedir"
- .stat(), .lstat()
- .readlink()
- .exists()
- .rename()
- .replace()
- .rmdir()
- .unlink()
- .symlink_to()
- .hardlink_to()
- is_absolute, is_relative_to
- is_block_device, is_char_device, is_dir, is_fifo, is_file, is_mount, is_reserved, is_socket, is_symlink
IO
network IO and file IO is bytes by nature so often involves either a lot of bare interfacing with explicit conversions, OR "I'll do everything for you" libraries.
file open()
py2 had file()[5] and open()[6] (The latter is the same as io.open) which achieved mostly the same thing, but with a somewhat different interface.
py3 has only open()[7]
open() also controls what type read() will give you.
py2:
- open() gives bytes
py3:
- open() in text mode (without 'b' in the mode): gives unicode
- has an encoding parameter, if not specified will default to locale.getpreferredencoding(False)
- on *nix and Android may well be 'UTF-8', on windows is probably 'cp1252'. Don't assume any of those.
- has an encoding parameter, if not specified will default to locale.getpreferredencoding(False)
- open() in binary mode (with 'b' in the mode): gives bytes. Specifying an encoding will raise ValueError.
subprocess:
py2:
- pipes are bytes
subprocess.Popen('ls', stdout=subprocess.PIPE,stderr=subprocess.PIPE).communicate() ('\xe1\x8f\x96\n', )
py3:
- pipes are bytes unless you explicitly specify an encoding
subprocess.Popen('ls', stdout=subprocess.PIPE,stderr=subprocess.PIPE).communicate() (b'\xe1\x8f\x96\n', b) subprocess.Popen('ls', stdout=subprocess.PIPE,stderr=subprocess.PIPE, encoding='utf8').communicate() ('Ꮦ\n', )
Bytes and strings around file IO
- py3: no more file()
- and the built-in 'open() (the synonym file() no longer exists) grew some extra behaviour:
- to read/write strings, automatically converted from their encoded form, specify an encoding. The default encoding is platform-dependent, so you probably don't want to rely on that.
- to read/write bytes, specify a binary mode, ('rb', 'wb') and don't specify an encoding
On in-memory versions:
- StringIO acts like a file opened in text mode,
- BytesIO acts like a file opened in binary mode.
See also:
misc
b'aa'.decode('hex_codec')
will tell you
LookupError: 'hex_codec' is not a text encoding; use codecs.decode() to handle arbitrary codecs
Instead, probably do (works on unicode and bytes input):
import codecs codecs.decode(b'aa', 'hex_codec')
Type annotation
See Python_notes_-_syntax_and_language#Type_annotation
StringIO, and now BytesIO
For context / basic use
Python2 had StringIO objects, objects that acted like file objects you could write() bytestrings into (e.g. useful where a function wants to write to file object, but you want to avoid the filesystem, e.g. for speed by avoiding IO, or to avoid potential permission problems)
Example:
sio = StringIO.StringIO()
Image.save(sio)
data = sio.getvalue()
Notes:
- because this is file-like, you cannot mix write() and read() without considering seek position), so people may use getvalue() over seek(0) and read()
- Once the StringIO object is close()d, the contents are gone
- not usually a problem, as
- most save functions either take a filename and does an open()-write()-close() (in which case stringio is fairly irrelevant), :: or take a file object and just write() (in which case you're fine)
- but if they write and close, you may need some monkey patching to get it to do what you want
- (could apparently also take unicode, but most people use it for bytes?)
StringIO was a python implementation in its standard library, cStringIO was a faster C implementation and drop-in, so you'ld often see
try: # use the faster extension when we can
import cStringIO as StringIO
except: # drop back to python's own when we must
import StringIO
Python3 changes
Since bytestrings are now called bytes and not str, the same functionality is now in io.BytesIO
- which behaves the same way
- though BytesIO also lets you have a read-write view[8]
- which py2 did not[9]
Python3 also has has a similar thing for unicode strings.
- io.TextIOWrapper is a seekable, file-like object, that also implies some conversions
- io.StringIO, which is more like a binary unicode representation
https://docs.python.org/3/library/io.html