Python usage notes/joblib

From Helpful
Revision as of 18:29, 21 April 2023 by Helpful (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Syntaxish: syntax and language · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Tasky: Concurrency (threads, processes, more) · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time


Notebooks

speed, memory, debugging, profiling · Python extensions · semi-sorted

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Joblib is a way to

  • serialize jobs, and
  • execute them on demand
  • execute them in parallel
  • offer memoization to avoid double work.
  • There's also some file-memmapping functionality (mostly share read-only data)
  • There's also some shared-memory functionality


It has its own way of doing things, but also has a few options that let you bolt it onto existing code.

It's numpy-aware - and should deal okayish with large arrays, compressing them where that's easy.



Memoization

joblib.Memory is disk-backed memoization, which you

could decorate functions with
could wrap in more explicitly to do the occasional checkpoint
could wrap into every Parallel call


Parallel execution

joblib.Parallel uses

lower overhead, but not always faster (consider GIL stuff)
more overhead, but can be safer.
like multiprocessing, with a few extra nice details
(also capable of threading)


joblib.dump() and joblib.load()) help serialize numpy data (and in general handles more types than plain pickle handles).

This is however not a portable format, because the underlying cloudpickle is only guaranteed to work in the exact same version of python, not between them. (verify)



delayed() is basically a cleanish way to pass in the function and its arguments to Parallel, without accidentally calling it as a function and doing that work in the main interpreter.

For example, the example from [2] is trying to parallelize

[sqrt(i ** 2)  for i in range(10)]

which works out as something like

Parallel( n_jobs=2 )( delayed(sqrt)(i ** 2)  for i in range(10) )

(Note that i**2 is still computed in the main thread(verify)




https://joblib.readthedocs.io/en/latest/