Python notes - speed, memory, debugging, profiling

From Helpful
Jump to navigation Jump to search
Syntaxish: syntax and language · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Tasky: Concurrency (threads, processes, more) · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time


Notebooks

speed, memory, debugging, profiling · Python extensions · semi-sorted

Profiling

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


time.time

timeit

profile and cProfile

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Deterministic profiling

profile is pure python

python -m profile file.py

to avoid some overhead, you can use cProfile when running on cpython

python -m cProfile file.py


The default is text output. You often want to use -o to store the profile data to a file you can inspect later.


Viewing profile data

pstats can parse such data, for code-level consumption or simple text output


You may prefer a graphical profile viewer like

hotshot

yappi

vmprof

stacksampler

Statistical profiler.


pyinstrument

pyflame

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

pyflame is a ptrace-based sampling profiler, of arbitrary python programs in that no instrumentation is necessary.

its output is a simple text file, made to be consumed by flamegraph.pl or similar.

See also

Memory-related notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Memory profilers (/leak helpers)

Include:

  • dowser [1] (gives graphical statistics - uses CherryPy, PIL)
  • heapy [2]
  • pysizer [3] (can use pyrex if available)
  • Python Memory Validator [4] (commercial)

Since Py3.4 there is also tracemalloc and tools that build on it, like stackimpact [7]


See also:


If you want to look at an already-running process, consider pyrasite(-shell)

The Garbage collector, gc

Speed notes

Depending on what exactly you do, most of the crunch-time CPU can be spent in C implementations of important functions and not the interpreter, making it fast enough for most purposes, and one reason python is slow at some things and faster than you expect at others.


Number crunching can often be moved to something like NumPy or PyGSL.

You can also easily pull in C libraries you have source to with SWIG, and even those you don't with ctypes - see Python extensions. You'll often need some wrapper code to make this code work a little more pythonically.

In this way, python can be expressive in itself as well as pull together optimized C code that would have more overhead than strictly necessary. (the reason using psyco can speed python up rather a lot sometimes)


TODO: read up on things like http://www.python.org/doc/essays/list2str.html


Text file line reading

Reading all lines from a file can be done in a few different ways:


  • readlines()
is a read() followed by splitting, so reads everything before returning anything
pro: less CPU than e.g. readline, because it's fewer operations
con: memory use is proportional to file size
con: slower to start (for large files)
note: you can tell it to read rouhly some amount of bytes. You'ld have to changes your logic, though.
  • readlines() with sizehint
basically "read a chunk of roughly as many lines as fit in this size"
pro: usually avoids the memory issue
pro: somewhat faster than bare readline()
con: needs a bit more code


Lazily:

  • iterate over the file object
generally the cleanest code cleanest way to go
  • iterate over iter(readline, ) (basically wrapping readline into a gnerator)
  • individual readline()s
generator style (pre-py2.3 there was an xreadlines(), which was deprecated in favour of this)


These three are functionally mostly the same

Sometimes one of these variants is slightly nicer, e.g. the brevity of putting that in a for line in ... versus more conditional control of individual readline() calls
note: EOF test varies:
the iterator tests it, wheras calling readline()
is whether len(line)==0 because it leaves in the (possibly-translated) newline



Debugging

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

See the IDE section; some offer debugging features.


Failing that, try pdb, the python debugger. (Maybe through Stani's Python Editor? Have never used that.)


To get information about an exception such as the stack trace - without actually letting the exception terminate things - use the traceback module. Most people will instead want more formatting by using the cgitb module, which gives more useful information, and can be used in a web server/browser output, but also be set to output plain text


See also pylint and PyChecker