Python notes - threads/threading

From Helpful
Jump to: navigation, search
Syntaxish: syntax and language · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Processes: threading · subprocess · multiprocessing · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time

semi-sorted

Intro

Python's threads are OS-level threads (pthreads, windows threads, or such) so rely on OS scheduling.

Python threads will not distribute among cores, due to the semantics of the GIL. C extensions can do better for their private work, but for parallel processing you're often better of doing multiprocessing (tends to be more portable and efficient than multithreading for cores anyway).


The standard library itself is mostly thread-safe, not yet fully. Exceptions are mostly in the places you'ld expect them, like IO. Python won't crash, but there are some things that aren't quite as atomic as you might expect.


There are two modules, the simple thread that provides basic thread objects, and the higher level threading that builds further and provides mutexes, semaphores and such.


Because in practice (and I paraphrase) 11 out of 10 people don't manage to implement a threaded app correctly, you may want to take a look at stackless and/or frameworks like Kamaelia to save you headaches in the long run.

Or just be conservative about what you share between threads, and how you lock that. Python makes that a little easier than lower-level languages, but there's still plenty room for mistakes.

Or just use it to separate things that won't block each other (...but in this case, also look at event driven systems - they can be more efficient depending on what you're doing).


thread module

The thread module provides threads, mutexes, and a few other things.

You can use it to fire off a function in a thread, but since you cannot wait for such threads, and main thread termination might monkeywrench things(verify), you probably want the threading module instead.


threading module

The threading module provides some more advanced locking mechanisms, and creates objects that represent threads.

Brief way

When you just want to create a thread for a specific function, this saves you half a dozen boilerplatish lines:

thr1 = threading.Thread(target=func)  # you can also hand along args, kwargs, and thread name
thr1.start()

Verbose way

Since this object merely contains the thread, and the object itself stays around after the thread terminated, you can retrieve data stuck on it by the thread while it was running.

When you've got a bunch of bookkeeping to do, this can be convenient. In other cases it just adds lots of lines with no point.

from threading import Thread
 
class MyThread(Thread):
    ''' An object that contains, and effectively is, a thread '''
    def __init__(self):
        Thread.__init__(self)
 
    def run(self): #the function that indirectly gets run when you start()
        self.stuff=time.time()
 
 
thr1 = MyThread() # create the thread object
thr1.start()      
 
thr1.join()       # wait for it to finish
 
print thr1.stuff  # retrieve something after the thread is done

local()

local() creates an object with data guaranteed to be local to the thread.

More precisely, data stored/retrieved in here will never be seen by another thread that accesses the same Local object.


This is great for temporary bookkeeping,

It also makes it easier to write thread functions you can reuse for concurrent threads, without worrying about collision mixups or races from non-local variables.


For example: (and making the point that the main interpreter is considered its own thread)

import threading,time
i=0
perthread=threading.local()
perthread.num=42
 
def f():
    global i
    i+=1
    perthread.num = i
    time.sleep(1)
    print perthread.num
 
#create the thread objects
thread1=threading.Thread(target=f)
thread2=threading.Thread(target=f)
 
#start the threads, wait for them to finish
thread1.start();thread2.start()
thread1.join();thread2.join()
 
print perthread.num

There are three different states in perthread, one for the main thread, one for the first fired thread, one for the second fired thread.

This will likely output 1, 2 and 42 (or possibly 2, 1 and 42, depending on which thread was scheduled first) because perthread was accessed from the two new threads and the main thread, respectively (...note that its value set from a non-locked global (i), so you could theoretically get racing and 1,1,42 as the result(verify)).

simple-ish pooling

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

In py3 you get help in the form of ThreadPoolExecutor.


In py2, things are a little more manual.

A decent way is to start as many threads as you want, and have each source jobs safely from the same place, probably via a thread-safe, multi-producer, multi-consumer queue like queue.Queue.

Note that if you're not so careful with exceptions in the threads individual ones will stop.



If you want to start a thread for each job, then there's a quick and dirty solution in something like:

jobs=range(50)    # these represent jobs to be started. You'ld use something real
fired=[]
target_threads = 5
 
while len(jobs)>0: #while there are jobs to be worked on
    if threading.active_count() < target_threads+1:  # if there are fewer threads working than we want
       t=threading.Thread(target=some_function( jobs.pop() ))
       t.start()
       fired.append(t) 
    else: #enough threads working
       time.sleep(0.2)
 
#Make sure all threads are done before we exit.
for th in fired:
    th.join()

daemon threads

When a python script exists, there's a hook (threading._shutdown(verify)) that effectively join()s all threads.

Threads started as daemon threads will be ignored by that code - and effectively just be interrupted by the process ending.

There is no special status to the threads, it's just some python-specific bookkeeping, whether an attribute was set before start()ing it.


In more practical terms, non-daemon threads are the reason scripts with threads will often seem hang when exiting, and daemon threads are a way to not do that.

...but are only a good idea when it's safe to stop the thread at all points.

Say, any thread doing supportively feeding things to the main thread is implicitly uninteresting once the main thread decides is exiting (because it's past doing jobs).

...whereas say an IO thread that may be doing writes should not be interrupted, so you would probably want to keep it non-daemon, and tell it to stop between its jobs, and join()ing it from the main thread (implicitly from the hook, or explicitly just to signal to maintainers that yes, you thought about it)


Since py3.3 you can hand
daemon=True
to Thread's constructor. Before then you'ld probably either
thr.daemon = True
or
thr.setDaemon(True)
- which must be done before thr.start() gets caled (because this management is done at thread init time)

Timely thread cleanup, getting Ctrl-C to work, and other subtleties

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


On join()

Join waits until a thread is done.

This is a locking mechanism, and does not imply any kind of cleanup. (If you want cleanup, your thread code should explicitly be told by you it's shutdown time, and doing so before it ends).

It's good common practice to wait on all child threads - but often not really necessary.

Cases where it makes sense is where the child thread must do cleanup to not break something, that should not be interrupted by the main thread exiting. Say, if you have an IO thread writing files, you probably want to not break in the middle of a file write.


Note that unless you mark threads as daemon threads, the interpreter will wait on all of them before exiting.

Which, if you thought about none of this, means it'll not exit very easily.


Dealing with Ctrl-C

Ctrl-C will arrive in only one thread.


Practically, you probably prefer to react to it in the main thread, mainly for ease of management, to have one clear place that tells others to stop.

If the signal module is available, then the signal will go to the main thread - but you can't count on that on every system. (details?(verify))

If you don't want to rely on that, the more robust variant is to ensure it goes to the main thread yourself: try-catch KeyboardInterrupt in all threads, and call thread.interrupt_main() in response.


(You may want to deal with other signals as well. In part because you may care, in part because some Python implementations will try to get the OS scheduler to switch faster to get the signal handled sooner - which means that scheduling is less efficient until you actually do)


The main thread needs to be scheduled at all to catch the signal, and for some code, the OS may not have any reason to do so - say, if it's indefinitely join()ed on a child threads, or waiting on a lock. As such, try to do those things on a timeout (in the case of a lock, this may require switching to a type of lock that supports checks / timeouts), or do polling.


If you want to be able to shut down threads cleanly in terms of whatever they're doing, you'll want some way of letting the main thread tell all others to stop their work soon - quite possibly just a shared variable (in the example below a global) that all threads can respond to soon enough.


Example dealing with a bunch of that:

import threading
import thread
import time
 
stop_now = False
 
# Your thread funtion might look something like:                                                                                                                                                               
def threadfunc():
    try:
        while not stop_now:
            print "child thread doing stuff"                                                                                                                                                                      
            time.sleep(1)
    except KeyboardInterrupt: # for cases where we see it, send it to the main thread instead                                                                                                                  
        print "\nCtrl-C signal arrived in thread"                                                                                                                                                              
        thread.interrupt_main()
 
 
# Now, assuming we've started threads we can join on, e.g.
fired_threads = []
f = threading.Thread(target=threadfunc)
f.start()
fired_threads.append( f )
 
# then your main loop can e.g. do
while len(fired_threads)>0:
    try:
        #wait for each to finish. The main thread will be doing just this.                                                                                                                                     
        for th in fired_threads:
            print "parent thread watching"
            th.join(0.2) #timeout quickly - this main thread still gets scheduled relatively rarely
            if not th.is_alive():
                #print "Thread '%s' is done"%th.name  # you may want to name your threads when you make them.                                                                                                  
                fired_threads.remove(th)
    except KeyboardInterrupt:
        print "\nCtrl-C signal arrived in main thread, asking threads to stop"                                                                                                                                 
        #some mechanism to get worker threads to stop,                                                                                                                                                         
        # such as a global they listen to within reasonable time                                                                                                                                               
        stop_now = True

Locks

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The following are factories that return new locks.

threading.Lock()

  • once acquire()d once, any thread doing another acquire()s will be block until the lock is released
  • any thread may release it

threading.RLock()

  • must be release()d by the thread that acquire()d it.
  • a thread may lock it multiple times (without blocking); acts semaphore-like in that multiple aquires should be followed by just as many release()s

threading.Semaphore([value])

threading.BoundedSemaphore([value])

On python threading, and the GIL

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(this is largely information from David Beazley's interesting GIL talk; see e.g. [1])