Mod wsgi notes

From Helpful
Jump to navigation Jump to search
Related to web development, lower level hosting, and such: (See also the webdev category)

Lower levels


Server stuff:


Higher levels


This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Application URL mounting - WSGIScriptAlias and WSGIScriptAliasMatch

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


WSGIScriptAlias statically maps an absolute vhost path to a WSGI script, for example:

WSGIScriptAlias /myapp /local/path/wsgiscripts/myapp.wsgi

Notes:

  • You may want to avoid a final slash. If you use /myapp/, a request for /myapp won't be a prefix match
    • so yes, this detail only applies to the single case of the implied index page
  • The .wsgi file need not be within the DocumentRoot tree
    • ...putting it (and your supporting modules) elsewhere means you'll never serve them accidentally
  • regardless of where code sits, importing further modules will probably require adding that path to sys.path
  • You can add more than one of these mappings. The first match is used, so ordering matters, and you probably want general/fallback handlers last.
  • If you mount to the root (/), then all requests go via WSGI, including things like favico.ico and robots.txt.
If you want to make apache handle do static file handling for specific files or directories, it may make sense to Alias them(verify)
  • What WSGIScriptAlias does is roughly the combination of Alias, Options ExecCGI, and SetHandler wsgi-script, and you can configure things that way
(in a few cases that way may be preferable -- including cases where you would want to use AddHandler instead of SetHandler, to mix handlers within a directory)(verify)


See also:



WSGIScriptAliasMatch

WSGIScriptAliasMatch is a regex-matching variation of WSGIScriptAlias.

It can be useful for controlled mapping to script filenames (though see also the mention of mapping WSGIScriptAlias to a directory, above).


For example, if you want /wsgi-scripts/image/something to go to image.wsgi, /wsgi-scripts/uni/ to go to uni.wsgi, and so on, without doing that all manually, you might use something like:

WSGIScriptAliasMatch ^/wsgi-scripts/([^/]+) /web/wsgi-scripts/$1.wsgi


See also:

.wsgi files

That is, the files pointed to by WSGIScriptAlias and friends.


There are just python code files, and putting the .wsgi extension on it (anything not .py) is purely a convention so that you won't think of it as a regular python module, because there are some practical differences.


A file with python code, conventionally containing a callable named application.

May contain anything you choose, really

  • one or two dozen lines that do things like
    • altering sys.path or other environment details to make things work
    • basic configuration for the main dispatched app
    • dispatching
  • hand over control to a framework (such as django).
  • a whole application
    • ...this often isn't very portable or convenient. You can't easily import this as a module.
    • ...and mod_wsgi won't consider these as part of #Code_reloading
    • Most of the limitations of these files seem to be by design - they were intended to be entry points, not workhorses.

Paths

Your DocumentRoot is not in your python path, so attempts to import things that are there or alongside the .wsgi file, which itself can and arguably should be elsewhere, will likely fail.

And you generally don't want to put code inside your DocumentRoot, because you risk exposing it.


If you need modules that are specific to the virtualhost's code, you'll need them on the sys.path.


You can do that in code, e.g.

mypath = '/path/to/mysite'
if mypath not in sys.path:
    sys.path.append( mypath )

but that's hardcoded you may prefer to not do.


You can also do that in config.

One way is to specify python-path on the WSGIDaemonProcess directive, which basically does an addsitedir().

(you can also change some other paths)


You may like to know that (since mod_wsgi 2.0) WSGIPythonHome will act like site.addsitedir()[1], which means that eggs on the path are also handled.

This also makes it useful to point to a virtualenv's site-packages.

But it's a server-context (not vhost-specific) thing, so it's only really for a site-wide thing.

Process/daemon control

Embedded versus daemon mode

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Embedded mode (the default) means the application is hosted in (an interpreter in) the Apache worker process.

  • slightly faster than daemon mode, probably mostly because it avoids the overhead of communication to the daemon.
  • for best performance under load you may need (apache-wide) MPM tweaking.
  • Code changes are not picked up automatically until an apache restart, so not so handy while debugging
  • The only method in windows
  • configuration perhaps a little harder to get right (or at least, easier to do something strange)
  • details vary on the MPM actually used


Daemon mode means mod_wsgi starts up and manages separate processes

  • creates a set of daemon processes (mount is configurable) which requests are delegated to
  • apache acts like a proxy to the actual process executing WSGI (but apache itself still serves static files where WSGI does not apply)
  • auto-reloads when source date changes (stops and starts the mentioned daemon processes) (verify)
  • tuning can be done for these processes, and separately per daemon process(...group) (which can be a lot handier than doing it apache-wide via MPM settings)
  • may scale better under load (partially because it can be tweaked better, partly because it can avoids the memory overhead per apache child)
  • More suitable for shared hosting in that it's easier to separate applications (particularly if they need to be run as different users)


While embedded may be slightly faster, perhaps the largest argument for daemon mode is that you get a pool of a fixed size (and a related ceiling on things like memory use and startup time) - because with embedded you can only do that by a hard limit on amount of apache children at MPM level (which can be a conflicting wish with other things you're hosting).


Daemon mode notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

WSGIDaemonProcess defines a group of multithreaded daemon processes (a worker pool, really), and names that set so we can refer to it.


You then also want to assign each script to such a process group, via a matching name. (When you don't, that script is still handling requests in embedded mode)

The sometimes easier way to do this is to use WSGIProcessGroup, which sets it as a default for all WSGIScriptAlias directives within the vhost.

The more explicit way is to set it as an option on each WSGIScriptAlias directive.

(When you have just one script, there's no difference, but this is e.g. a way to have a distinct production and testing environment)


For example, to have a vhost-specific daemon (4 processes, each limited to 2 threads), you could add something like:

# Create a daemon group with name "example.com"
WSGIDaemonProcess example.com processes=4 threads=2

# send apps within this vhost to process group called "example.com"
WSGIProcessGroup example.com



WSGIDaemonProcess

  • there are a whole bunch of options relating to environment and resource management [2]
  • When creating a name
    • you can use an explicit name
    • but it's also potentially useful to use one of a few things from apache, e.g.
      • %{SERVER} - server hostname (and port only if not 80 or 433(verify)) basically gives you a process set per vhost
      • %{RESOURCE} - %{SERVER} plus SCRIPT_NAME, gives you a process set per script (much like thinking up a unique name would)
      • %{GLOBAL} -
      • %{ENV:variable} - the value of the request's environment variable. May e.g. be useful to combine with rewrite rules
  • further options allow you to control:
    • processes: the amount of daemon processes (defaults to 1) (note: setting this value will set wsgi.multiprocess to True, even when you set it to 1)
    • threads: handler threads in each daemon process (defaults to 15)
    • things like the amount of requests to handle before restarting a daemon (default is never), useful to counteract memory leaking
    • who to run as (e.g. user=me group=users), useful for cases where apache runs as root (which is a bad idea, though)
    • display-name: what we show up as in top/ps and such (you can set a string yourself, or %{GROUP} which will be (wsgi:groupname)
    • various debug options
    • some path options
    • ...more


Notes:

  • one set of processes per app means that misbehaviour (like responding slowly) affects only things within that process group, so app.
  • one reason you may steer multiple apps to the same group of daemons is somewhat reducing memory use (fewer loaded copies of libraries and such). And for simpler apps you may not have to worry about misbehaviour much.
  • ...but you may then find you will need subinterpreters to avoid things messing with each other via globals, * and there are reasons not to want subinterpreters
so the safer option is to create a daemon pool per codebase (where you know and can control the issues)


  • You can define WSGIDaemonProcess within an apache virtualhost, or in global context. There is only a difference in how you can use them: In a particular virtualhost you can tell mod_wsgi to use a specific named daemon process, be it one defined globally or in the same virtualhost, but not one defined in another vhost. (and, footnote, you can't reuse names between them))
  • A daemon group consists of one or more multi-threaded processes (see also [3])
the threads are created by the APR, but (once they are created/used) they act exactly like usual python threads, and you can do all the locking, threading.local() stuff you are used to (there were some past bugs related to this, but as far as I know they're gone now)
keep in mind that some apps may decide that for GIL related reasons multithreading isn't worth it

on sub-interpreters

mod_wsgi will use python subinterpreters to sandbox each apache-config-mapped WSGI entry point.

It will do so on the fly in the process it is mapped to -- daemon processes in daemon mode, apache children in embedded mode.


The upside of subinterpreters is things like

lower memory/startup overhead use than isolating via a process group per script
preventing apps from accidentally seeing each other's data
dealing with modules that nastily relies on globals (e.g. matplotlib)


One caveat is that the sandboxing isn't perfect, so it's more about resources and convenience than a security feature. (For example, environment is shared, things can affect each other via IO, and extensions are shared (which is like a shared library -- but can also be trouble if the library wasn't written to be safely used in this way)

But probably the main one is that (while cool in pure-python) interplay with the GIL and with C extensions can be very messy -- to the point it can break things and you want to not use subinterpreters for a particular script/process group.


Using fewer subinterpreters that the default setting implies (which is basically 'per apache-configured script'(verify)) can in theory also be useful if you want to force distinct scripts to share globals, or when you want to direct things like WSGIImportScript, WSGIAuthUserScript, WSGIAuthGroupScript or WSGIAccessScript a little more specifically.


In any case, the directive is WSGIApplicationGroup[4], and the main argument is a name.

The same name goes to the same subinterpreter.

You can set any string you figure is unique enough, but it can be useful to use one of:

%{GLOBAL} (which is an empty string) forces things to stay in the main interpreter.

%{SERVER} vhost name. If port isn't 80 or 443, it will be added.

%{RESOURCE} hostname, port, and SCRIPT_NAME (The part of the URL's that apache matched to arrive at this script)

the default - which is probably as isolated as you want, so you'd only change this to loosen the restriction

%{ENV:variable} - perhaps useful in combination with rewrite rules or such.



Notes:

  • each subinterpreter will live as long as the process (except when interpreter reloading is enabled, in which case one will be recreated if the wsgi script changes).
  • Apps that are meant to be portable ought not to count on sharing via (sub)interpreter globals (For data you'll need an external data store, such as a memcache, shared memory, database, or such. Connection pools will need to be distinct)


on threads

On streaming

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Debugging

http://modwsgi.readthedocs.io/en/develop/user-guides/debugging-techniques.html

MPM tweaking

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
In daemon mode

Apache is just proxying to a fixed set of WSGI daemon processes, so the choice of MPM barely matters. (you can leave it up to other conditions, e.g. thread-unsafe PHP extensions)

Daemon mode's memory use tends to be a lot more predictable since the apache children (typically configured to scale to load) no not run the code at all.

At the apache side, the only thing left is that its amount of handlers should probably be slightly higher than the amount of handlers in the wsgi daemon processes.


In embedded mode

Each apache child has a python interpreter. It needs time to start up.

The default apache config has a lowish minumum of standby handlers, which means it will only create more handlers when it's under load, which will mean you'll get the dynamic app's startup overhead at the worst time.

Setting a larger number of standby handlers will have more more initialized interpreters around all the time.

...yes, at the cost of memory, yes, but memory that your config implied eventually using anyway. Tweaking the amount of extra processes is something you have to do yourself. If you want more predictability, consider daemon mode.

How much this early initialization matters depends on how significant it is (and the reason it's low by default because for apache's static file handling the overhead is very small).


See also:

Lazy loading, preloading

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Code reloading

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

In python in general there is no truly best way of reloading code in-place - largely because of the qsuestion "how do you reload all the modules?".


While there are hackish ways that may work well enough for your specific project, the cleanest method is to start from scratch - start a new process (or interpreter, but that is made more complex by subinterpreters).

(in mod_wsgi 2.x versions there was another option, but it was later removed, proably because it was more confusing than useful)


The option is WSGIScriptReloading and it is on by default.

You can disable it for production environments, but know that that means that if you ever do need to reload code, the only option to do so is restarting apache.

Note that it will only work in daemon mode (because in embedded mode it is apache that owns and controls the processes, and in some MPMs it may never be killed at all -- so in embedded mode the surest option to really really reload everything in embedded mode is to restart apache)


Keep in mind that it only reacts to changes in the .wsgi file, not anything you've imported. This means you may need to touch that file.

When that wsgi file contains just config and probably the app instance those may get set again -- but not any code that comes from modules you imported.

You could put all your app code in the wsgi file, but it's probably handier to either write some explicit reloading code (e.g. url arguments that let devs control this) or just count on restarting after all.



See also:

Semi-sorted

Truncated or oversized response headers received from daemon process

If you also have regular timeouts blocking access to your app for a long time:

It is moderately likely this is caused by a C extension (like numpy, scipy, skimage) that doesn't understand being run in a sub-interpreter (mod_wsgi uses it for lower resource use, but few C extension really consider it).

To check this theory, put WSGIApplicationGroup %{GLOBAL} in the applicable vhost config.

If that solves everything, you probably want to leave that in there, and read up on what that means.

PyEval_AcquireThread: non-NULL old thread state

You probably have both mod_wsgi and mod_python loaded, and both with their own statically linked copy of the standard library (libpython), which can lead to the runtime linker being confused, and to confused state of python.

The workaround is to compile mod_wsgi (or either? both?) with the library dynamically linked.

The fix is to stop using mod_python.

Premature end of script headers

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


You can probably cause this explicitly if you really wanted to, but usually it's a bug.


If it happens only rarely and you use maximum-requests, it may well be related to daemon mode process restarts. When a handler takes longer than apache wants to wait (5 sec? unconfigurable?) it will kill it unelegantly, and cause this error.

(Long-running requests/responses can cause this. If you think this might be because of uploads (which can always be from slow clients), consider using using nginx for (particularly request) buffering.


If it happens consistently (...for a code path...), the browser sees no response (and will eventually decide to time out), this often means the daemon process (for the request) crashed.

Apps running in the same process group may not respond, but you may want to set process=1 for more consistent failing behaviour(verify)

You may see a "segmentation fault" (or something similarly serious) in the logs, or nothing (In some caes, LogLevel info in apache config may yield more helpful information).

Possible causes (many unverified)

  • it has been associated with warnings -- don't I don't yet understand how
    • fixable by ignoring the relevant (or all) warnings?(verify)
    • ...or asking for warnings to be thrown as errors?(verify)
  • C extensions that weren't written to behave well in subinterpreters
  • mixing python version (sometimes implied by system python and a different one pulled in by mod_wsgi)
  • incompatible shared library versions (between mod_wsgi, apache, php, others?)
    • e.g. PHP using the non-reentrant version of the mysql lib, and python using the re-entrant one
  • mod_python conflict


You may get half a workaround in switching to embedded mode(verify)


Fatal Python error: Couldn't create autoTLSkey mapping

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

TLS in this context means Thread Local Storage (not Transport Layer Security).


Apparently, when you do a process fork (via subprocess or some other way) from a threaded program (possibly particularly for threads not created by python itself?), there is some post-fork cleanup in python itself that seems to say this.

It seems to be somewhat specific to python 2.7, and seems fixed since 2.7.3.


While it is a python error, it is only brought out by a few things, mod_wsgi being one of them (may be related to subinterpreters?(verify) if so, then using WSGIApplicationGroup %{GLOBAL} may help(verify)).



I haven't really cared to figure this out yet. Do tell me if you do.


See also:

Unsorted

Notes:

  • Remember that the code will run as whatever user the server (e.g. apache) runs as.
consider e.g. filesystem access
  • Given the WSGI ability for multithreading, applications must be re-entrant (concurrently executed), so state should be kept within that application, and global/external data should be intended to be shared, and probably locked to avoid clobbering, racing and such.
  • Given apache's varying process models (MPMs), data that should be synced between all applications on a host is ideally handed around externally (database, memcache, maybe shared memory), or you'll probably be counting too much on behaviour of a specific MPM and possibly incompatible with others.


Memory leaks

I had one set of processes go berserk occasionally.

Hints:

  • use separate daemon processes where possible, and use the name= to indentify them in top.
(e.g. my largest ones is a large but perfectly valid reverse index)
I noticed it was my "the rest, the little stuff" that was the issue; munin showed it was doing the constant-growth-eventually-getting-oomkilled thing.
  • See if it's a module that isn't thread-safe, or doesn't understand threading or mod_wsgi's use of and subprocessors (both of which are good for efficient and isolated handlers)
Mostly some modules weren't written with these in mind. Sometimes it leads to nonsense behaviour (shared state and all), sometimes to racing that brings our real issues.
to see if this is the problem, see if running the code only in the main interpreter (and not in a subinterpreter) fixes it
WSGIApplicationGroup %{GLOBAL}
in my case it did, which tells me my options are roughly
figure out which module/code is causing this and isolate it (e.g. do the global-interpreter for only it)
leave it like this -- less efficient request handling, but not bad either, and for my miscellaneous stuff it's fine.
  • nothing good with good ol' logging
if you have half an idea what code it is, basic at this time I started X', 'at this time it finished and took this long' type logging



"Unable to connect to WSGI daemon process 'name' on '/var/run/something.sock' after multiple attempts as listener backlog limit was exceeded."=

...and probably a 503 Service Unavailable at the browser end.


All of the daemon workers are busy, we are part of a larger queue waiting on them, and after a bunch of time did not get a worker.

This, in any language or server, means your code is taking longer than it should for the request rate you are getting.

The higher the request rate you want to support, the more important it is that you never do any serious work in the request handler itself.

If you're doing hard calculation, offload that to the background (e.g. celery tends to make this simpler).

If it's not CPU time, look to locks, IO, trashing, etc.

See also