Python usage notes/joblib

From Helpful
Jump to navigation Jump to search

Syntaxish: syntax and language · type stuff · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency · exceptions, warnings


IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Tasky: Concurrency (threads, processes, more) · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time


Notebooks

speed, memory, debugging, profiling · Python extensions · semi-sorted

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Joblib is a way to

  • serialize jobs, and
  • execute them on demand
  • execute them in parallel
  • offer memoization to avoid double work.
  • There's also some file-memmapping functionality (mostly share read-only data)
  • There's also some shared-memory functionality

It is also sometimes used only as a serialization format


Has a few options that let you bolt it onto existing code, and relatedly has its own way of doing things,


It's numpy-aware - and should deal okayish with large arrays, compressing them where that's easy.


Memoization

joblib.Memory is disk-backed memoization, which you

could decorate functions with
could wrap in more explicitly to do the occasional checkpoint
could wrap into every Parallel call


Parallel execution

joblib.Parallel uses

lower overhead, but not always faster (consider GIL stuff)
more overhead, but can be safer.
like multiprocessing, with a few extra nice details
(also capable of threading)


joblib.dump() and joblib.load()) help serialize numpy data (and in general handles more types than plain pickle handles).

This is however not a portable format, because the underlying cloudpickle is only guaranteed to work in the exact same version of python, not between them. (verify)



delayed() is basically a cleanish way to pass in the function and its arguments to Parallel, without accidentally calling it as a function and doing that work in the main interpreter.

For example, the example from [2] is trying to parallelize

[sqrt(i ** 2)  for i in range(10)]

which works out as something like

Parallel( n_jobs=2 )( delayed(sqrt)(i ** 2)  for i in range(10) )

(Note that i**2 is still computed in the main thread(verify)




https://joblib.readthedocs.io/en/latest/