Python usage notes/Subprocess

From Helpful
(Redirected from Subprocess)
Jump to: navigation, search
Various things have their own pages, see Category:Python. Some of the pages that collect various practical notes include:

subprocess module

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

tl;dr:

  • available in ≥py2.4
aims to replace most earlier things (os.popen, os.system, os.spawn, or the commands or popen2 modules), and is more predictable cross-platform than some of those.
  • You usually want to use the subprocess.Popen class
(It also has subprocess.call(), which is slightly shorter when you can wait for it to finish. It just creates a Popen object and wait()s on it)
  • shell=
False: an array of strings (to be handled more execv style), a little leaner because it avoids that extra process, can be less bothersome with shell escaping, and can be more secure against injection attacks
True: hand in single string to be parsed by the shell it is run in. Can be more predictable in terms of environment, or just lazier in general


  • if you want stdin, stdout, and/or stderr to go to you rather than the terminal, you need to specify you want that
then you can read() and write(), and/or use the gathered results
> /dev/null
can be imitated by having
DEVNULL = open(os.devnull, 'w')
and handing that in (though if you use shell=True you might as well do it there)
  • if you can wait until it's done, use communicate()
handles stdin/stdout/stderr in the same line
most other options (waiting/polling, reading in chunks rather than a single blob) are also valid, but more work
  • if you need to interact with it, even just read output, then blocking calls are an issue. Read up.


command in single string or array, and shell=True/False

single string and shell=True

You are handing this string to the shell to parse and execute.
Can e.g. include multiple commands, pipes, and other shell features.
Gives the same escaping trouble as you would typing commands into a normal shell.
be careful to sanitize strings. Someone can try to exploit the following example with, say, name = '"; cat /etc/shadow; echo "'
Lets you write:
p = subprocess.Popen('ps ax | grep "%s"'%name, stdout=subprocess.PIPE, shell=True)
output,_ = p.communicate()


Array of strings and shell=False

Safer, but often a little more code.
Some things will notice they're not running in a shell and act differently.
(In a few cases you can only get sensible behaviour with the variant above)
The previous example in this style would be something like:
ps   = subprocess.Popen(['ps','ax'],    shell=False, stdout=subprocess.PIPE)
grep = subprocess.Popen(['grep', name], shell=False, stdin=ps.stdout, stdout=subprocess.PIPE)
grep_output,_ = grep.communicate()


The other two combinations don't make sense

An single string with shell=False is equivalent to placing that string into a single-item list - it won't work unless it's a single command
A sequence with shell=True seems to use args[0] and ignore the rest.


Popen constructor arguments

  • args, which can be either a single string or a sequence of argument strings, see above
  • shell=False (execute through the shell, /bin/sh). Note that
    • many programs don't need shell=True, but it may be simpler for you when you use shell features like wildcards, pipes, here documents and whatnot.
    • shell=True means characters special to the shell are interpreted (may be a pain to escape them)


  • stdin=None, stdout=None, stderr=None, each of which can be
    • None: no redirection, usually meaning they stay tied to the shell that python was started from
    • subprocess.PIPE - meaning you get an object you can read() from (for stdout, stderr) or write() to (for stdin)
    • a file object, or file descriptor (integer)
    • also, you have the option of merging stderr and stdout (specify stderr=subprocess.STDOUT)
  • bufsize=0 (applies to stdin, stdout, and stderr (verify) if subprocess fdopen()s them)
    • 0 means unbuffered (default)
    • 1 means line buffering
    • ≥2 means a buffer of (approximately) that size
    • -1 / negative values imply system default (which often means fully buffered)
  • env=None
    • None (the default) means 'inherit from the calling process'
    • You can specify your own dict, e.g. copy os.environ and add your own
  • cwd=None
    • if not None, is taken as a path to change to. Useful for cases where files must be present in the current directory.
    • Does not help in finding the executable
    • does not affect the running python program's cwd (verify)
  • executable=None
    • When shell=True, you can specify a shell other than the default (/bin/sh on unix, the value of COMSPEC on windows)
    • When shell=False, you can specify the real executable here and use args[0] for a display name.


  • universal_newlines=False
    • If True, '\n', '\r', and '\r\n' in stdout and stderr arrive in python as '\n'.
    • (Depends on code that may, in fairly rare cases, not be compiled into python)
  • preexec_fn=None
    • call a python callable in subprocess, before the external call. Unix only.
  • close_fds=False
    • Closes all file handles left open (other than stdin/stdout/stderr (0/1/2))
  • startupinfo and creationflags
    • Windows-only

Popen object members

  • stdin:
if you constructed Popen with stdin=PIPE: file object
if you didn't (default): None
(note: if you need only a single interaction with the process, then using communicate() is often simpler)
  • stdout,stderr (note: can be ignored if you use communicate())
file objects if you constructed Popen with stdout=PIPE / stderr=PIPE
None if you didn't (default)



related to completion

  • communicate(input=None)
...sends input string (if specified), reads stdout and stderr into memory (as requested), returns those two as strings
see below for more detail
  • poll() for child process completion. Handy when you want to watch several sub-processes, or do stuff asynchronously.
returns process return code if it's done,
returns None if it's not
  • wait(timeout=None) for child process completion.
once finished, returns the process's return code.
if you specified a timeout and it timed out, raises TimeoutExpired ()
communicate() is preferred due to the deadlock issue(verify)


  • pid: child process's Process ID
  • returncode:
None - before the child has terminated
integer return code - after the child has terminated
note that 127 often means shell=True and the shell couldn't find the executable (see
man sh
,
man bash
) (if shell=False, it will have raised an OSError)
On unix, negative values signal termination by signal (-abs(signalnum))

Where the streams go

By default, the subprocess object's stdout and stderr streams are not touched, typically meaning they end up in the underlying shell.


If you specify subprocess.PIPE

then you get file handles in the the calling python process corresponding to stderr and stdout on the relevant subprocess object.
communicate() will read from these until process is complete, and return one string each.
Convenient for short-running programs.
Alternatively you can read(), readline(), etc. on them as you wish
...but note this (unlike communicate) has some potential deadlock situations.


This deadlock is not unique to python, or unix (Python's read and write wrap libc's pretty much as-is).

It's actually a sort-of-textbook example, for cases where you have more than one limited-size pipe, combined with blocking reads/writes.

Consider one side is doing a blocking write on the other -- which only actually blocks when the corresponding buffer is currently full (waiting for the other end to start emptying it) -- and the other side is doing a blocking read on the empty other (waiting for some data to appear).


You can sometimes get away with not thinking about this at all, because this deadlock tends not to happen when:

  • the subprocess tends to spam both streams with roughly equal amounts of short messages,
  • when it doesn't write to one of them (will never block writes on that)
  • if there is more buffering than the subprocess typically outputs to either
  • if the subprocess actually acts in a prompt / request-response style (because that means both sides will tend to use one stream)


Use of wait() apparently has a similar issue.(verify)



In general, your choices:

  • For a non-interactive program, use
    communicate()
    , which just gives you the full outputs as strings, and is written to be free from this specific deadlock trouble.
  • If you want to react as things happen, or know the output may take too much RAM, then you must stream instead
    • Consider merging the two streams using
      stderr=subprocess.STDOUT, stdout=subprocess.PIPE
      (...though how clean they mix depends on how exactly the underlying process flush. It's fine in most situations, but separating them is cleaner)
    • If you must have them separately, options include:
      • use threads for each pipe
      • you can use select(), though the entirely correct logic is a little long (also: (verify)this is cross-platform)
      • use O_NONBLOCK, though that changes your logic and also makes it a little more complex


For example, to collect both independently with threads, building on a base like:

def readerthread(fh, buffer):
    buffer.append( fh.read() )
 
out_ary = []
out_thread = threading.Thread(target=readerthread, args=(p.stdout, out_ary))
out_thread.start()
 
err_ary = []
err_thread = threading.Thread(target=readerthread, args=(p.stderr, err_ary))
err_thread.start()
 
out_thread.join()
err_thread.join()


Note that when you use readline() in the above, that python by default adheres to POSIX newlines, i.e. \n, and doesn't consider \r to be newline. Which means that in some cases you want to look at universal_newline=True on the subprocess call (looks for the others, and translates them for you - see PEP 278) for it to work as you expect.

On stream buffering

Keep in mind that (and most of this is not python-specific)

  • stdin is buffered
  • stdout is typically buffered, or line buffered on shells
  • stderr may not be buffered
buffering at all means there is no true order to what comes in on these two streams (unless you remove all buffering (usually hinders performance) and the program isn't threaded)
  • ...this applies within each process. You can often not control how a program buffers or flushes.
  • a pipe also represents a buffer


bufsize: see its mention in argument list above. This is basically the buffering applied on python's size if and when it uses fdopen() on the underlying file descriptor. (verify)


It seems that iterating over a file object for its lines, i.e.

for line in fileobject:

adds its own buffering within the iterator(verify), so if you want more realtime feedback, either add your own while loop around readline - or get the same behaviour via:

for line in iter(fob.readline, ''): #note: on py3 that should typically be b''


Usage notes, examples

wait()ing

Handy when you want to block until the subprocess quits.

p = subprocess.Popen("ps ax | grep %s"%name, stdout=subprocess.PIPE)
p.wait()
output = p.stdout.read()
p.stdout.close()


communicate()ing

Handy convenience function when you want to block and handle input and output data: The communicate() function sends in data, wait()s for the process to finish, and returns stdout and stderr, as pipes, strings, or whatnot. Example:

p = subprocess.Popen("sendmail -t -v", shell=True,
                     stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = p.communicate( email_message.as_string() )


If you want to watch several sub-processes, you'll be interesting in poll() (returns return code, or None if not yet finished(verify)).


PATH and environment

You can rely on PATH (and other inherited environment(verify)) whether shell=True or False -- unless it is explicitly cleared for some reason (sudo, some embedded interpreters).

On errors

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


A lot of errors will fall under OSError, in which case you probably want to check the exception's .errno

  • errno==2 (ENOENT, 'file not found'_, here meaning the executable
if it's there, it may e.g. be you're handing a list into shell=True
  • errno==11 (EAGAIN, 'Resource temporarily unavailable'), probably on the os.fork() call, usually means the maximum amount of child processes is reached (think ulimit) -- and probably because they're not cleaned up
also note that communicate() does this for you, so may be convenient
if you need streaming, then it's probably that you're not closing all pipes (also consider stdin, if you open it)


On signals

Replacing older styles with subprocess

For details, see http://docs.python.org/release/2.5.2/lib/node533.html


Summary of that:

  • Most of the previous styles rely on shell parsing, so the easiest method is to pass in the string as before and set shell=True
    • ...except os.spawn*, it's list-based. If you're using this, you probably want to read up on the details anyway.
    • ...and popen2.popen[234] in cases where you give it a list (it can take a string and sequence and choose what you now handle with shell)
  • redirect as you need to, get the file objects from the Popen object
  • hand along bufsize if you need it
  • You may want to check out differences in whether the call closes open file handles
  • You may want to check the way errors arrive back in python


Older stuff

Historically, there have been a number of system call methods, mostly:


  • os members
    • os.popen()
    • os.system()
    • os.spawn...
  • commands (a convenience wrapper around os.popen)
  • popen2 (2.4, 2.5; deprecated in 2.6)
    • popen2.popen2()
    • popen2.popen3()
    • popen2.popen4()
    • popen2.Popen3 class
    • popen3.Popen4 class