Python notes - semi-sorted

From Helpful
Jump to: navigation, search
Various things have their own pages, see Category:Python. Some of the pages that collect various practical notes include:


Python3 notes

I still code for python2 (almost all system pythons are 2, often ~2.7 as of this writing), but it's starting to become a good idea to learn to code for both.


Two things you want to know about:

  • this is one useful summary of differences
  • 2to3 is a tool to parse code and show suggested change as a diff


Notes to self:

  • type-related stuff
    • int and long are now the same thing, so you can use int everywhere
    • In py2, 1/2 == 0 (stays an int). In py3, division is always float division. You may more often want to use floor division,
      //
  • you can't mix tabs and spaces anymore. Probably a good thing.
  • py2 had both file() and open(). py3 has only open()
  • print is now a function (needs brackets)
adding brackets is usually perfectly backwards compatible with py2
if you use the comma trick to avoid a newline, you can't do that in py3 anymore. Adding
{{{1}}}
is not py2 compatible. You'll need to do some rewriting of your prints, or use sys.stdout.write() instead
if you can assume ≥py2.6(verify) you could use
from __future__ import print_function
  • exception syntax: use 'as' instead of a comma
except Exception as e
except (ValueError, TypeError) as e

instead of py2's:

except Exception, e
except (ValueError, TypeError), e
Using
as
has been supported since py2.6 or py2.7 (verify), and that's getting good enough since system python is now often 2.7
  • iterators
    • more things return an iterator or view, rather than a list (changes things if you did some type testing. Changes nothing if you just use them as sequences)
    • map() and filter() are now defined
    • py2 had .next for iterators, py3 has a next() built-in.
  • strings and unicode:
in py2,
means bytestring and
u
means unicode string
in py3,
b
means bytestring and
means unicode string (and ≥py3.3 accepts
u
)
When you use strings to print text, things will mostly work as-is.
When you are explicitly handling conversions between these things, you'll need to rewrite that
When you need to suppor both py2 and py3
You can get py3 behaviour in ≥py2.6 via
from __future__ import unicode_literals
(verify)
in other cases you may wish to cheat, e.g. instantiating all strings via a function that does some conversion based on the python version
TODO: figure out how to best deal with libraries
  • buffer, memoryview and such changed.
largely but not fully compatible. If you use them, you'll need to read up.
  • py3 has new-style classes only (that is, inherit from object)
  • has_key is gone. The in keyword should handle all cases.
  • exec is now a function (needs brackets)

See also:

Setting the process name

It's OS-specific stuff, so there is no short portable way.

The easiest method is to install/use the setproctitle module - it aims to be portable and try its best on various platforms. Example use:

try:
    import setproctitle
    setproctitle.setproctitle( os.path.basename(sys.argv[0]) )
except ImportError:
    pass

The above often means 'use the filesystem name that was used to run this' - but not always, so a hardcoded string can make sense.

Useful links

If you are new to Python, common suggestions for learning it include Dive Into Python, The python flavour of How to Think Like a Computer Scientist, or Thinking In Python.





Libraries and documentation



Semi-sorted:


Additional notes you may wish to have seen some time

array, deque

A list can be used as a sort of deque, like:

append(val)     # insert on right side 
pop()           # take from right side
insert(0,val)   # insert on left side
pop(0)          # take from left side

However, list is primarily efficient for stack-like use - list.pop(0) and list.insert(0, val) are O(n) operations,

...while those are O(1) operations on collections.deque (added in 2.4). deque also has appendleft(), extendleft(), and popleft(), and some others that can be convenient and/or more readable.

You may also wish to know about the queue nodule, a multi-producer, multi-consumer, thread-safe queue.

See also:

StringIO

You can write to StringIO objects, and ask them for the data they caught so far. They store this data only in memory.

Is is mostly useful where a function wants to write to file object, but you want to avoid the filesystem (for convenience, to avoid potential permission problems, for speed by avoiding IO).

Two caveats:

  • you can write(), but you cannot read() -- you can only getvalue() the full contents so far
  • Once the StringIO object is close()d, the contents are gone
not usually a problem, as most save functions either take a filename and does an open()-write()-close() (in which case stringio is fairly irrelevant), or take a file object and just write() (in which case you're fine)


cStringIO is the faster, written-in-C drop-in. It is often useful to do:

try: # use the faster extension when we can
    import cStringIO as StringIO
except: # drop back to python's own when we must
    import StringIO

See also:

Date stuff

Speed, memory, debugging

Profiling

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

One easy way is to use cProfile.py, which I like in a wrapper script like:

#!/bin/bash
python -m cProfile -s time $@

...or -o to saving it to a file, so that you can inspect it more flexibly.


For a little more control, you could add e.g. hotshot to your __main__ code. I've taken up the habit of putting it in a function like:

def profile():
    import hotshot
    prof = hotshot.Profile("hotprof") #file to write the profile to
    prof.runcall( main ) # or a wrapper function containing the things you want profiled
    prof.close()


...and use something like RunSnakeRun, a graphical profile viewer. In text mode, you'ld probably want a simple script to view the results instead, for example:

import hotshot.stats
stats = hotshot.stats.load("hotprof")
stats.strip_dirs()
stats.sort_stats('time', 'calls')
stats.print_stats(20)

Memory-related notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Memory profilers (/leak helpers)

Include:

  • dowser [1] (gives graphical statistics - uses CherryPy, PIL)
  • heapy [2]
  • pysizer [3] (can use pyrex if available)
  • Python Memory Validator [4] (commercial)

See also:


The Garbage collector, gc

Speed notes

Depending on what exactly you do, most of the crunch-time CPU can be spent in C implementations of important functions and not the interpreter, making it fast enough for most purposes, and one reason python is slow at some things and faster than you expect at others.


Number crunching can often be moved to something like NumPy or PyGSL.

You can also easily pull in C libraries you have source to with SWIG, and even those you don't with ctypes - see Python extensions. You'll often need some wrapper code to make this code work a little more pythonically.

In this way, python can be expressive in itself as well as pull together optimized C code that would have more overhead than strictly necessary. (the reason using psyco can speed python up rather a lot sometimes)


TODO: read up on things like http://www.python.org/doc/essays/list2str.html


Text file line reading

Reading all lines from a file can be done in a few different ways:


  • readlines()
is a read() followed by splitting, so reads everything before returning anything
pro: less CPU than e.g. readline, because it's fewer operations
con: memory use is proportional to file size
con: slower to start (for large files)
note: you can tell it to read rouhly some amount of bytes. You'ld have to changes your logic, though.
  • readlines() with sizehint
basically "read a chunk of roughly as many lines as fit in this size"
pro: usually avoids the memory issue
pro: somewhat faster than bare readline()
con: needs a bit more code


Lazily:

  • iterate over the file object
generally the cleanest code cleanest way to go
  • iterate over iter(readline, ) (basically wrapping readline into a gnerator)
  • individual readline()s
generator style (pre-py2.3 there was an xreadlines(), which was deprecated in favour of this)


These three are functionally mostly the same

Sometimes one of these variants is slightly nicer, e.g. the brevity of putting that in a for line in ... versus more conditional control of individual readline() calls
note: EOF test varies:
the iterator tests it, wheras calling readline()
is whether len(line)==0 because it leaves in the (possibly-translated) newline



Debugging

See the IDE section; some offer debugging features.


Failing that, try pdb, the python debugger. (Maybe through Stani's Python Editor? Have never used that.)


To get information about an exception such as the stack trace - without actually letting the exception terminate things - use the traceback module. Most people will instead want more formatting by using the cgitb module, which gives more useful information, and can be used in a web server/browser output, but also be set to output plain text


See also pylint and PyChecker



Snippets

chardet as a utility

Something like:

#!/usr/bin/python
import sys,os
try:
    import chardet
except ImportError:
    print "You do not have the 'chardet' module."
    sys.exit(-1)
 
# ~8K is a quick but often decent guess 
# ~50K is noticably slower (on non-ascii), but is less likely to miss stuff.
# ~100K or more can makes the confidence more informative
readsize = 25*1024
 
if len(sys.argv)<2:
    print "need some filenames to work on"
else:
    for fn in sys.argv[1:]:
       if os.path.isfile(fn):
           print '%-30s'%fn,
           f = open(fn)      
           d = chardet.detect(f.read(readsize)) 
           f.close()
 
           print '  %-12s  '%d['encoding'],
           print '  (confidence: %.1f%%)'%(100.*d['confidence'])

...in some place like /usr/local/bin/chardet

It may be useful to combine it with python-magic, or a subprocess.Popen to file -b --mime-type filename or such, to only do this on text types (or to imitate and extend the functionality of file)

Strings

HTML entity decoder

I noticed another bug; will paste in the corrected version sometime




Remove diacritics from text

regex_combining = re.compile(u'[\u0300-\u036f\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]',re.U)
 
def remove_diacritics(s):
    """ Decomposes string, then removes combining characters.
        Hand this a unicode string, not an encoded one
    """
    #TODO: Figure out whether the NFC is unnecessary
    return unicodedata.normalize('NFC', 
               regex_combining.sub('',unicodedata.normalize('NFD', unicode(s))) 
    )

Process memory use

Python doesn't seem to directly report what it's up to (and details are sort of relevant since it allocates in chunks).


In *nix OSes you can use proc, for example like:

import os,re
reMemFigure = re.compile('\s*([0-9]+)\s*([kmgt])?b',re.I)
units=['','k','m','g','t']
 
def proc_process_status(pid=None):
    """ Returns process statistics, most interestingly memory usage, in a dict.
        Converts summarized memory sizes to byte amounts for ease of consumption.
        Reports the python process calling the process unless another pid is given.
        (...and you're probably interested in VmRSS)
    """
    if pid==None:
        pid=os.getpid()
    f=open('/proc/%d/status'%pid)
    lines=f.readlines()
    f.close()
    ret={}
    for line in lines:
        a=line.rstrip('\r\n').split('\t')   #strip: the line's newline
        key=a[0].rstrip(':')
        value='\t'.join(a[1:]) #in case there are tabs in the value, restore them
        figure = reMemFigure.search(value)
        if figure!=None:      #is it a K/M/G figure? Convert to bytes
            num,kmgt=figure.groups()
            num=int(num,10)
            for uniti in range(len(units)):
                if kmgt==units[uniti]:
                    num *= 1024**uniti
            ret[key]=num
        else: # leave as whatever string it was
            ret[key]=value
    return ret

Now, pprint.pprint(process_status()) will return something like:

{'CapBnd': 'ffffffffffffffff',
 'CapEff': '0000000000000000',
 'CapInh': '0000000000000000',
 'CapPrm': '0000000000000000',
 'Cpus_allowed': '0f',
 'Cpus_allowed_list': '0-3',
 'FDSize': '256',
 'Gid': '1004\t1004\t1004\t1004',
 'Groups': '1004 1007 ',
 'Mems_allowed': '1',
 'Mems_allowed_list': '0',
 'Name': 'python2.6',
 'PPid': '5217',
 'Pid': '5224',
 'ShdPnd': '0000000000000000',
 'SigBlk': '0000000000000000',
 'SigCgt': '0000000180000002',
 'SigIgn': '0000000001001000',
 'SigPnd': '0000000000000000',
 'SigQ': '0/147456',
 'State': 'R (running)',
 'Tgid': '5224',
 'Threads': '1',
 'TracerPid': '0',
 'Uid': '1000\t1000\t1000\t1000',
 'VmData': 1581056,
 'VmExe': 4096,
 'VmHWM': 3633152,
 'VmLck': 0,
 'VmLib': 3272704,
 'VmPTE': 20480,
 'VmPeak': 6639616,
 'VmRSS': 3633152,
 'VmSize': 6639616,
 'VmStk': 98304,
 'nonvoluntary_ctxt_switches': '1004',
 'voluntary_ctxt_switches': '63'}

And yes, of course this is a somewhat fragile hack.


On windows (and cpython) you can use win32com and ctypes to ask windows; see e.g. [5] (one of the comments)



You can sum up the sizes of python objects since 2.6:

def object_memory():
    """ Sum of the size of gc-tracked objects. 
        Can be useful to summarize, say, the before-and-after difference
         of loading in data, building inices, and such.
 
        Does NOT reflect the amount of process memory, because it does not count things like:
          the interpreter,(~3 MB+?), 
          interpreter overhead,
          libraries, 
          data stored in C extensions (think numpy and such), 
          shared memory
        ...most of which can be anywhere from next to nothing up to a _lot_.
 
        sys.getsizeof() is >=py2.6  (though there are imitations of it for earlier versions)
    """
    import gc,sys
    return sum(sys.getsizeof(o)  for o in gc.get_objects())


See also:

Mail

This is probably only useful to copy out code fragments. A helper function I have to assembling an email message with specified text encodings, and attachments:


def mail(to_address, from_address, main_message, subject='', attachments=None, try_server='localhost', try_sendmail=True):
    ''' 
       main_message should be a unicode string, or a UTF-8 encoded bytestring.
 
       If you specify attachments, hand in a sequence of (filename,filemime,filedata) triples
 
       try_server should be either None/False, or a server address (e.g. localhost)
       try_sendmail can be True or False
    '''
    import os, smtplib, email,email.MIMEMultipart,email.MIMEText,email.MIMEBase
 
    ### Compose the message string
    msg = email.MIMEMultipart.MIMEMultipart() #overall message container
    # The basic headers (you may want more than these):
    msg['From']      = from_address
    msg['Return-to'] = from_address
    msg['To']        = to_address
    msg['Date']      = email.Utils.formatdate(localtime=True)
    msg['Subject']   = subject
 
    if type(main_message) is unicode:
         main_message = main_message.encode('utf8')
    msg.attach( email.MIMEText.MIMEText( main_message, _charset='utf-8' ) ) # (UTF-8 text/plain)
 
    # attach files, if any
    if attachments:
        for filename,filemime,filedata in attachments:
             typ,subtyp=filemime.split('/') #the MIME type is handed in in two parts
             part = email.MIMEBase.MIMEBase(typ,subtyp)
             part.set_payload( filedata )
             Encoders.encode_base64(part)
             # set the filename for the attachment (you may want to add some protection)
             part.add_header( 'Content-Disposition', 
                              'attachment; filename="%s"'%os.path.basename(filename))
             msg.attach(part) #and add it to the message
 
    msg_str = msg.as_string() #return the message flattenned into text
 
 
    ### Now try sending it
    # In general, you can try to do this via an SMTP server (which can be contacted with Python modules).
    # On *nix systems, you may also want to offer the alternative of piping to sendmail 
    #  (itself regularly actually a receptor script from the MTA you have installed). 
    # The below tries to guess whether it was successful
    sent=False
    if try_server:
        import socket
        try: # See if there is a mail server locally
            smtp = smtplib.SMTP(try_server)
            smtp.sendmail( from_address, to_address, msg.as_string() )
            smtp.close()
            sent=True
        except socket.error: #probably a 'rejected' because there was no mail server
            pass
    if not sent and try_sendmail:
        try:
            import popen2 # TODO: use subprocess
            (stdouterr,stdin) = popen2.popen4("sendmail -t -v")
            # -t means 'take recipients from message I give you instead of from the command'
            stdin.write(msg.as_string())
            stdin.close()
            output=stdouterr.read() #in case you want to log what it said (not used here)
            stdouterr.close()
            sent=True
        except ImportError: # no popen2? (I still need to look at the popen variants)
            pass
        except IOError: # Some error?
            pass
    return sent

Note that various environments, including web frameworks, may remove paths from the environment, which may mean you have to hardcode an absolute path into the popen() call, such as /usr/sbin/sendmail.


Detect OS and/or path style

Note that you do not need to know the OS to split and join paths elements correctly -- you can rely on os.path for that.

You can get the
uname
[6] fields or a good imitation:
os.uname       returns a 5-tuple: (sysname, nodename, release, version, machine)
platform.uname returns a 6-tuple: (sysname, nodename, release, version, machine, processor)

Note that the contents of the following are relatively free-form. For example, examples for playform.uname:

('Linux',   'zeus',  '2.6.34-gentoo-r12', '#5 SMP Wed May 25 01:15:12 CEST 2011', 'i686', 'Pentium(R) Dual-Core CPU E5700 @ 3.00GHz')
('Windows', 'spork', '7',                 '6.1.7600',                             'x86',  'Intel64 Family 6 Model 23 Stepping 10, GenuineIntel')


You can detect the path style, via the path separator (os.sep, os.path.sep - one of those is probably deprecated, and I should figure out which one) to figure out what style of paths we should be using, and as a hint of what OS we are on. Don't use this for path string logic -- you can do things safely using os.path functions.

if os.sep=='/':
    print "*nix-style paths"
elif os.path.sep=='\\':
    print "Windows-style paths"
else:
    print "Very Weird Things (tm)"

Note that Windows CE has a single root instead of drive letters, but still uses backslashes. It is hard to completely unify path logic because of such details.


Python under windows is slightly smart about programmers mixing \ and /. That is to say, mixes will work when python processes path logic itself (e.g. open(), not values passed verbatim into subprocess/popen/system calls).

Also, note that this is a(nother) reason that that string equality is not a good test for path equality, and that you shouldn't do path splitting and such with your own string operations (look to os.path.something functions).

There may be other details, so don't be lazy - use the path splitting and joining functions instead of appending a character yourself.


You can also inspect:

  • sys.platform (e.g. linux2, darwin, win32, cygwin, sunos5, and various openbsd and netbsd strings)
  • os.name (e.g. posix, nt, ce)
  • or even os.environ

Unfinished

File follow

Unfinished, untested. Abandoned because subprocess.Popen()ing
tail -F
still turned out to be more convenient for the case this was written for.
def follow_by_name(filename, interval=0.5):
      ''' Returns a generator that keeps on yielding lines
          whenever the file changes.
          Watches filename (so is an imitation of tail -F, not tail -f)
      '''
      sl= os.stat(filename)
      prev_inode = sl[1]
      prev_pos   = 0
 
      while True: #open/reopen
            print 'Opening %s'%filename
            f = open(filename)
 
            replaced=False
            while not replaced:
                  time.sleep(interval)
                  #print "..."
                  #Check for content change (well, size change)
                  back_to = f.tell() # Remember where we are right now
                  f.seek(0,2) # Check size by seeing where end of file is
                  cur_pos = f.tell()
                  f.seek(back_to,0) # leave position as it was before check
                  if cur_pos!=prev_pos:
                        line = f.readline()
                        yield line # assumes
                        prev_pos=cur_pos
 
                  # check whether the file was replaced (by inode)
                  cur_sl= os.stat(filename)
                  cur_inode = sl[1]
                  if cur_inode!=prev_inode:
                        replaced=True
                        prev_inode=cur_inode
            f.close()

Unsorted


http://wiki.python.org/moin/ParallelProcessing


http://www.limsi.fr/Individu/pointal/python/pqrc/versions/PQRC-2.4-A4-latest.pdf

path.py does some os path wrapping that you may like.


To use os.chown with user and group names, use pwd.getpwnam and grp.getgrnam.


Mechanical browser

Twill looks neat as it seems to be able to act like a real browser (supporting forms, cookies and using the parsed page) from a simple interactive shell or from python code. (although you do need to silence it with e.g. a redirect into a stringIO object)

There is also a py-mechanize



dnslib

https://pypi.python.org/pypi/dnslib

Many examples seem to be about creating DNS responses, dealing with zone information, and such.


My own use was parsing DNS sniffed from the network.

The easiest class to use in that case is DNSRecord, which will contain a DNSHeader, and one or more DNSQuestion or DNSRR (answer) records.

That does mean more conditions, because I catch both queries and responses.

Destructors

blaze

Think numpy, but for big data: it's designed to deal with data that is larger than system memory, on distributed and on streaming data.

http://blaze.pydata.org/docs/index.html


pandas

Data analysis, modelling. Think R-like things within python

http://pandas.pydata.org/


pylint notes

Some library notes

3D

PyGame

Win32 interface

pywin32, previously known as win32all, provides hooks into various parts of windows. Apparently with central module win32api. (see also its help file)

downloadable from sourceforge and with a homepage here.


Some code for this:


GPGPU

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Reikna

Clyther

PyOpenCL


PyStream


cudamat

scikits.cuda


gnumpy

Theano


Unsorted

Theano notes

http://deeplearning.net/software/theano/install_ubuntu.html

http://deeplearning.net/software/theano/install.html#gpu-linux

Unsorted

Creating ZIP files (in-memory)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

We can directly use in-memory data and StringIO objects here (so this is simpler than doing the same with tarfile).

You can add content with either:

  • ZipFile.write - takes filename it will open - useful when you just wanted more control over what files to add
  • ZipFile.writestr takes a filename/ZipInfo and a bytestring - the filesystem isn't touched, which is probably what you want when adding in-memory data to in-memory zip files.
# Just an example snippet
zip_sio=StringIO.StringIO()
z = zipfile.ZipFile(zip_sio, "w", zipfile.ZIP_DEFLATED) # or another compression level
 
for filename,filedata in (('filename1.txt', 'foo'),
                          ('filename2.txt', 'bar')):
   z.writestr(filename,filedata)
z.close()
return zip_sio.getvalue()


Ipython

iPython is a collection of:

  • an interactive shell, more featured than python's own.
http://ipython.org/ipython-doc/rel-0.12/interactive/tutorial.html
  • integrates with some interactive data visualization
  • integrates with GUI toolkit
...both of which are used in...
  • notebooks - served via browser, allows embedded code, text, plots, mathematical expressions
  • tools for parallel computing (due to itself being abstracted out this way)
  • makes it easier to embed an interpreter into your own project


See also:



I like the way you can hook profiling into ipython, via its magic functions:

  •  %time - how much time (one run)
  •  %timeit - how much time (in a bunch of runs, at least a second's worth?(verify))
  •  %prun - how much time, per function
  •  %lprun - how much time, per line
  •  %mprun, %memit - how much memory per function (once, a bunch)

See also:

notebook

Functionally, the iPython notebook is a web browser based interactive shell, integrating some plotting, math, and such. (more technically)


To see what people have done with it, see e.g. https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks#table-of-contents

As of IPython 4.0, the notebook part of the project was separated off into Jupyter


To use

The actual notebook gets stored in the current directory (so you may wish to organize notebooks into directories a bit)

Anyway:

ipython notebook

starts a backend and launches a browser that looks at it.

You then probably want to look at Help → keyboard shortcuts. Most important to start with is probably ShiftEnter: Run, go to next cell



Could not open file <notebook> for safe execution.

An error meaning you have a very old version of ipython. Currentish versions are at least 2.something, while notebook was added in 0.12

In my case the reason was that I had not installed ipython, but installed software had an old version placed in the PATH.

FFTs

There are a few implementations/modules, including:

  • fftpack: used by numpy.fft, scipy.fftpack


Speed-wise: FFTW is faster, numpy is slower, scipy is slower yet (not sure why the np/sp difference when they use the same code) Think a factor 2 or 3 (potentially), though small cases can be drowned in overhead anyway.

Also, FFTW planning matters.


It does matter how the coupling works - there for example are more and less direct (overhead-wise) ways of using FFTW.

TODO: figure out threading, MPI stuff.


See also: