Python notes - semi-sorted

From Helpful
Jump to: navigation, search
Syntaxish: syntax and language · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Processes: threading · subprocess · multiprocessing · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time

semi-sorted



Ipython

iPython is a collection of:

  • an interactive shell, more featured than python's own.
http://ipython.org/ipython-doc/rel-0.12/interactive/tutorial.html


  • integrates with some interactive data visualization
  • integrates with GUI toolkit
...both of which are used in...
  • notebooks - served via browser, allows embedded code, text, plots, mathematical expressions


  • some tools for parallel computing (due to itself being abstracted out this way)
  • makes it easier to embed an interpreter into your own project


  • I like the way it eases hooking in profiling into ipython, via its magic functions:
 %time - how much time (one run)
 %timeit - how much time (in a bunch of runs, at least a second's worth?(verify))
 %prun - how much time, per function
 %lprun - how much time, per line
 %mprun, %memit - how much memory per function (once, a bunch)

See also:


notebooks and jupyter

Python notebooks mean you can play through a web interface.

Notebooks are webpage frontend (to a interactive backend), this makes it

easier to play with code visually than the shell
easier to use remotely
easier to persist the notebook
easier to persist the interpreter behind it (to a degree)


This means less typing and more prettiness while you're doing plotting, math, or anything else you can manage via the python shell.

You can copy the notebooks elsewhere, bootstrapping other people to do similar experiments to yours.

For some examples, see e.g. https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks


This used to be part of ipython (and called ipython notebooks), but this has since widened, and became more of a protocol (it already was fairly agnostic, being largely JSON over 0MQ) to whatever sort of backend you want.

That framework is called jupyter, and ipython is just one of its possible kernels - see this list for more.

You can build your own, for existing languages, or basically build your own DSL to play with.

If you like the technical details, see e.g.

https://ipython.org/ipython-doc/3/development/how_ipython_works.html
https://ipython.org/ipython-doc/3/development/messaging.html

The below mostly focuses on python.



Basic use

The actual notebook gets stored in the current directory (so you may wish to organize notebooks into directories a bit) where you run:

jupyter notebook


By default it binds to 127.0.0.1:8888 and locally launches a browser.

When working remotely
consider using a SSH tunnel like
ssh -L localhost:8888:localhost:8888 workhost
(and pointing your browser at 127.0.0.1:8888 on the SSH-client side)
On a trusted LAN you can consider doing
--ip=0.0.0.0
(and maybe
--port=80
) so that it's easily reachable.


You then probably want to look at Help → keyboard shortcuts.

Most important to start with is probably ShiftEnter: Run, go to next cell



JupyterLab

Basically a more extended/extensible, more integration and more convenience


JupyterHub

Basically, it's a login service (e.g. PAM, OAuth), and keeps track of notebooks per user (which are still single-user).


Basic python3 conversion notes

I still coded for python2 while system pythons were still ~py2.7, but even then it was good idea to learn to code for both.


Two things you want to know about:

  • this is one useful summary of differences
  • 2to3 is a tool to parse code and show suggested change as a diff


Notes to self, roughly in order of 'has bitten me most'

  • print() now requires brackets like all other functions
which sometimes now changes semantics a little when passing in sequences(verify)
if you use the comma trick to avoid a newline, you can't do that in py3 anymore. Adding
{{{1}}}
is not py2 compatible. You'll need to do some rewriting of your prints, or use sys.stdout.write() instead
if you can assume ≥py2.6(verify) you could use
from __future__ import print_function


  • exception syntax: use 'as' instead of a comma
except Exception as e
except (ValueError, TypeError) as e

instead of py2's:

except Exception, e
except (ValueError, TypeError), e
Using
as
has been supported since py2.6 or py2.7 so just use it always now.


  • more things return an iterator or view, rather than a list (which is nice)
e.g. map() and filter() return iterators, as does zip, and range() now effectively acts like xrange() used to
Changes nothing if you just use them as sequences). But in some places you now need an explicit list() or tuple()
py2 had .next() for iterators, py3 has a next() in syntax


  • strings and unicode:
in py2,
means bytestring and
u
means unicode string
in py3,
b
means bytestring and
means unicode string
and ≥py3.3 accepts
u
(which does the exact same thing as , but makes porting from py2 easier)
When you use strings to print text, things will mostly work as-is.
When you are explicitly handling conversions between these things, you'll need to rewrite that
When you need to support both py2 and py3
You can get py3 behaviour in ≥py2.6 via
from __future__ import unicode_literals
(verify)
in other cases you may wish to cheat, e.g. instantiating all strings via a function that does some conversion based on the python version
TODO: figure out how to best deal with libraries
iterating over a bytestring returns integers in py3 (was one-length strings in py2)
  • relatedly, here are some extra details around paths and shell and subprocess stuff
command line arguments are unicode (verify)
subprocess pipes returns bytestrings (verify)
paths are always unicode strings now
...meaning there are a few more combinations where you need to think about conversions.
  • relatedly, open() is now either text mode (returning str) or binary mode (returning bytes), depending on whether you specified 'b'
defaults to the platform encoding (which is probably utf8 but you may want to not count on that)


  • buffer, memoryview and such changed.
largely but not fully compatible. If you use them, you'll need to read up.


More on the details side:

  • has_key is gone. The in keyword should handle all cases.
  • py3 has new-style classes only (that is, inherit from object)
which most pweole were using already
  • py2 had both file() and open(). py3 has only open()
  • some more type-related stuff
int and long are now the same thing, so you can use int everywhere
Division:
In py2
/
coerces to float if it involves a float, if int stays an int, e.g. 1/2==0, 1/2.0==0.5
//
coerces to float if it involves a float, and does a floor, e.g. 1//2==0, 1//2.0==0.0
In py3
/
is always float division, e.g. 1/2==0.5
//
as in py2





See also:


For the most part, moving from py2 to py3 means reviewing all your code. For the most part this is trivial, largely automatic (see 2to3).


Moving to py3 means fixing your code. For the most part this is simple


Interesting python3 features

del

del x
removes the binding of x, in the namespace it is done from (unless it was declared global in that namespace).

If that was the only reference (in any scope), then it may well mean it is going to be garbage collected very soon (gc details vary per python implementation). (Note also that for local variables, del x is practically similar to x=None)


It turns out you can also del

values from a list (removes value from key)
slices (same idea)
keys dict (removes reference to key and value)
attribute references

presumably this is special-cased in these classes?(verify)


(though pop() is more practical in some dict, list cases)

https://docs.python.org/3/reference/simple_stmts.html#the-del-statement

Setting the process name

It's OS-specific stuff, so there is no short portable way.

The easiest method is to install/use the setproctitle module - it aims to be portable and try its best on various platforms. Example use:

try:
    import setproctitle
    setproctitle.setproctitle( os.path.basename(sys.argv[0]) )
except ImportError:
    pass

The above often means 'use the filesystem name that was used to run this' - but not always, so a hardcoded string can make sense.

Useful links

If you are new to Python, common suggestions for learning it include Dive Into Python, The python flavour of How to Think Like a Computer Scientist, or Thinking In Python.





Libraries and documentation



Semi-sorted:


Additional notes you may wish to have seen some time

array, deque

A list can be used as a sort of deque, like:

append(val)     # insert on right side 
pop()           # take from right side
insert(0,val)   # insert on left side
pop(0)          # take from left side

However, list is primarily efficient for stack-like use - list.pop(0) and list.insert(0, val) are O(n) operations,

...while those are O(1) operations on collections.deque (added in 2.4). deque also has appendleft(), extendleft(), and popleft(), and some others that can be convenient and/or more readable.

You may also wish to know about the queue nodule, a multi-producer, multi-consumer, thread-safe queue.

See also:

StringIO

You can write to StringIO objects, and ask them for the data they caught so far. They store this data only in memory.

Is is mostly useful where a function wants to write to file object, but you want to avoid the filesystem (for convenience, to avoid potential permission problems, for speed by avoiding IO).

Two caveats:

  • you can write(), but you cannot read() -- you can only getvalue() the full contents so far
  • Once the StringIO object is close()d, the contents are gone
not usually a problem, as most save functions either take a filename and does an open()-write()-close() (in which case stringio is fairly irrelevant), or take a file object and just write() (in which case you're fine)


cStringIO is the faster, written-in-C drop-in. It is often useful to do:

try: # use the faster extension when we can
    import cStringIO as StringIO
except: # drop back to python's own when we must
    import StringIO

See also:

Date stuff

Speed, memory, debugging

Profiling

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

One easy way is to use cProfile.py, which I like in a wrapper script like:

#!/bin/bash
python -m cProfile -s time $@

...or -o to saving it to a file, so that you can inspect it more flexibly.


For a little more control, you could add e.g. hotshot to your __main__ code. I've taken up the habit of putting it in a function like:

def profile():
    import hotshot
    prof = hotshot.Profile("hotprof") #file to write the profile to
    prof.runcall( main ) # or a wrapper function containing the things you want profiled
    prof.close()


...and use something like RunSnakeRun, a graphical profile viewer. In text mode, you'ld probably want a simple script to view the results instead, for example:

import hotshot.stats
stats = hotshot.stats.load("hotprof")
stats.strip_dirs()
stats.sort_stats('time', 'calls')
stats.print_stats(20)

Memory-related notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Memory profilers (/leak helpers)

Include:

  • dowser [1] (gives graphical statistics - uses CherryPy, PIL)
  • heapy [2]
  • pysizer [3] (can use pyrex if available)
  • Python Memory Validator [4] (commercial)

Since Py3.4 there is also tracemalloc and tools that build on it, like stackimpact [7]


See also:


If you want to look at an already-running process, consider pyrasite(-shell)

The Garbage collector, gc

Speed notes

Depending on what exactly you do, most of the crunch-time CPU can be spent in C implementations of important functions and not the interpreter, making it fast enough for most purposes, and one reason python is slow at some things and faster than you expect at others.


Number crunching can often be moved to something like NumPy or PyGSL.

You can also easily pull in C libraries you have source to with SWIG, and even those you don't with ctypes - see Python extensions. You'll often need some wrapper code to make this code work a little more pythonically.

In this way, python can be expressive in itself as well as pull together optimized C code that would have more overhead than strictly necessary. (the reason using psyco can speed python up rather a lot sometimes)


TODO: read up on things like http://www.python.org/doc/essays/list2str.html


Text file line reading

Reading all lines from a file can be done in a few different ways:


  • readlines()
is a read() followed by splitting, so reads everything before returning anything
pro: less CPU than e.g. readline, because it's fewer operations
con: memory use is proportional to file size
con: slower to start (for large files)
note: you can tell it to read rouhly some amount of bytes. You'ld have to changes your logic, though.
  • readlines() with sizehint
basically "read a chunk of roughly as many lines as fit in this size"
pro: usually avoids the memory issue
pro: somewhat faster than bare readline()
con: needs a bit more code


Lazily:

  • iterate over the file object
generally the cleanest code cleanest way to go
  • iterate over iter(readline, ) (basically wrapping readline into a gnerator)
  • individual readline()s
generator style (pre-py2.3 there was an xreadlines(), which was deprecated in favour of this)


These three are functionally mostly the same

Sometimes one of these variants is slightly nicer, e.g. the brevity of putting that in a for line in ... versus more conditional control of individual readline() calls
note: EOF test varies:
the iterator tests it, wheras calling readline()
is whether len(line)==0 because it leaves in the (possibly-translated) newline



Debugging

See the IDE section; some offer debugging features.


Failing that, try pdb, the python debugger. (Maybe through Stani's Python Editor? Have never used that.)


To get information about an exception such as the stack trace - without actually letting the exception terminate things - use the traceback module. Most people will instead want more formatting by using the cgitb module, which gives more useful information, and can be used in a web server/browser output, but also be set to output plain text


See also pylint and PyChecker



Snippets

Detect OS and/or path style

Note that you do not need to know the OS to split and join paths elements correctly -- you can rely on os.path for that.

You can get the
uname
[8] fields or a good imitation:
os.uname       returns a 5-tuple: (sysname, nodename, release, version, machine)
platform.uname returns a 6-tuple: (sysname, nodename, release, version, machine, processor)

Note that the contents of the following are relatively free-form. For example, examples for playform.uname:

('Linux',   'zeus',  '2.6.34-gentoo-r12', '#5 SMP Wed May 25 01:15:12 CEST 2011', 'i686', 'Pentium(R) Dual-Core CPU E5700 @ 3.00GHz')
('Windows', 'spork', '7',                 '6.1.7600',                             'x86',  'Intel64 Family 6 Model 23 Stepping 10, GenuineIntel')


You can detect the path style, via the path separator (os.sep, os.path.sep - one of those is probably deprecated, and I should figure out which one) to figure out what style of paths we should be using, and as a hint of what OS we are on. Don't use this for path string logic -- you can do things safely using os.path functions.

if os.sep=='/':
    print "*nix-style paths"
elif os.path.sep=='\\':
    print "Windows-style paths"
else:
    print "Very Weird Things (tm)"

Note that Windows CE has a single root instead of drive letters, but still uses backslashes. It is hard to completely unify path logic because of such details.


Python under windows is slightly smart about programmers mixing \ and /. That is to say, mixes will work when python processes path logic itself (e.g. open(), not values passed verbatim into subprocess/popen/system calls).

Also, note that this is a(nother) reason that that string equality is not a good test for path equality, and that you shouldn't do path splitting and such with your own string operations (look to os.path.something functions).

There may be other details, so don't be lazy - use the path splitting and joining functions instead of appending a character yourself.


You can also inspect:

  • sys.platform (e.g. linux2, darwin, win32, cygwin, sunos5, and various openbsd and netbsd strings)
  • os.name (e.g. posix, nt, ce)
  • or even os.environ

Some library notes

3D

PyGame

Win32 interface

pywin32, previously known as win32all, provides hooks into various parts of windows. Apparently with central module win32api. (see also its help file)

downloadable from sourceforge and with a homepage here.


Some code for this:


GPGPU

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Reikna

Clyther

PyOpenCL


PyStream


cudamat

scikits.cuda


gnumpy

Theano


Unsorted

Creating ZIP files (in-memory)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

We can use in-memory data and StringIO objects here - so this is sometimes simpler than doing the same with tarfile.

You can add content with either:

  • ZipFile.write - takes filename it will open - useful when you just wanted more control over what files to add
  • ZipFile.writestr takes a filename/ZipInfo and a bytestring - the filesystem isn't touched, which is probably what you want when adding in-memory data to in-memory zip files.
# Just an example snippet
zip_sio=StringIO.StringIO()
z = zipfile.ZipFile(zip_sio, "w", zipfile.ZIP_DEFLATED) # or another compression level
 
for filename,filedata in (('filename1.txt', 'foo'),
                          ('filename2.txt', 'bar')):
   z.writestr(filename,filedata)
z.close()
return zip_sio.getvalue()

FFTs

There are a few implementations/modules, including:

  • fftpack: used by numpy.fft, scipy.fftpack


Speed-wise: FFTW is faster, numpy is slower, scipy is slower yet (not sure why the np/sp difference when they use the same code) Think a factor 2 or 3 (potentially), though small cases can be drowned in overhead anyway.

Also, FFTW planning matters.


It does matter how the coupling works - there for example are more and less direct (overhead-wise) ways of using FFTW.

TODO: figure out threading, MPI stuff.


See also:


Bytecode / resolve-related notes

  • .py - source text
  • .pyc - compiled bytecode
  • .pyo - compiled bytecode, optimized. Written when python is used with -O. The difference with pyc is currently usually negligible.
  • .pyd - a (windows) dll with some added conventions for importing
(and path-and-import wise, it acts exactly like the above, not as a linked library)
note: native code rather than bytecode


All of the above are searched for by python itself.

Python it will generate pyc or pyo

when they are imported them (not when the modules are run directly)
...with some exceptions, e.g. when importing from an egg, or zip file it will not alter those
...which means it can, for speed reasons, be preferable to distribute those with pyc/pyo files in them


There is some talk about changing these, see e.g. PEP 488


See also: