Python usage notes/joblib
Syntaxish: syntax and language · type stuff · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency · exceptions, warnings
IO: networking and web · filesystem Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly
Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML speed, memory, debugging, profiling · Python extensions · semi-sorted |
Joblib is a way to
- serialize jobs, and
- execute them on demand
- execute them in parallel
- offer memoization to avoid double work.
- There's also some file-memmapping functionality (mostly share read-only data)
- There's also some shared-memory functionality
It is also sometimes used only as a serialization format
Has a few options that let you bolt it onto existing code,
and relatedly has its own way of doing things,
It's numpy-aware - and should deal okayish with large arrays, compressing them where that's easy.
Memoization
joblib.Memory is disk-backed memoization, which you
- could decorate functions with
- could wrap in more explicitly to do the occasional checkpoint
- could wrap into every Parallel call
Parallel execution
joblib.Parallel uses
- lower overhead, but not always faster (consider GIL stuff)
- more overhead, but can be safer.
- loky[1]
- like multiprocessing, with a few extra nice details
- (also capable of threading)
joblib.dump() and joblib.load()) help serialize numpy data (and in general handles more types than plain pickle handles).
This is however not a portable format, because the underlying cloudpickle is only guaranteed to work in the exact same version of python, not between them. (verify)
delayed() is basically a cleanish way to pass in the function and its arguments to Parallel,
without accidentally calling it as a function and doing that work in the main interpreter.
For example, the example from [2] is trying to parallelize
[sqrt(i ** 2) for i in range(10)]
which works out as something like
Parallel( n_jobs=2 )( delayed(sqrt)(i ** 2) for i in range(10) )
(Note that i**2 is still computed in the main thread(verify)