Mod python notes

From Helpful
Jump to navigation Jump to search
These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.
This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software or research).


I have moved to WSGI for my python webdev so will not update this page anymore.

Also, since I never really cleaned most of this to my satisfaction, don't believe everything you read.


Quick intro

mod_python is an apache module that hosts python interpreters, which has some fairly unobtrusive glue layer to web functionality.

Unobtrusive also meaning 'you have to do a lot yourself', and its minimalism is also the source of some peculiarities; see e.g. the the 'mod_python and ...' sections in here.

If you want a high-level platform you may want to do a thorough comparison with turbogears, django, cherrypy, and/or go the more portable WSGI way (...though note that in WSGI it's harder to stream out data without breaking the WSGI model a bit).

mod_python as a project is now dead; I suggest you go the way of WSGI instead.

Forms, cookies, headers, etc.

Note the -like in dict-like

The form object, input headers, output headers, and a few others act like dicts for convenience, and are usually mp_table objects, which wrap around apache tables, which are are actually (string-string) maps that allow duplicate entries, which can be necessary e.g. in the case of cookies.

Note that these things have case-insensitive key lookups.

From the python side, the duplicate thing means that a get may return either a string or a list of strings. If you want to set exactly one value and overwrite anything previous, you can assign directly. If you want to nondestructively keep a list list, you should use .add().


Request headers

The header objects are:

  • req.headers_in
  • req.headers_out
  • req.err_headers_out

headers_out is different from err_headers_out in that latter will get sent instead of the former when you cause an apache error return. This makes it easy to, for example, avoid sending out cookies that you had previously set when you later discover you need to bomb out and send an error page.


Example:

 #set Expires one day ahead (according to server time)
 req.headers_out['Expires'] = (
   datetime.datetime.utcnow()+
   datetime.timedelta( 1, 0 )
 ).strftime('%a, %d %b %Y %H:%M:%S UTC')

Forms: GET/POST values, files

The publisher unconditionally parses form values, and puts the results on the req object. I don't like the publisher, so I have to do it myself:

form = mod_python.util.FieldStorage(req,keep_blank_values=1)

form is an instance of mod_python.util.FieldStorage, which is also a dict-like object. When you use it as one to get a named form element, you get back and object that is one of:

  • a StringField instance, which is a thin wrapper around the built-in str type
  • a Field instance, for file uploads (see below)
  • a list, if you had multiple things with the same name - which is fairly normal in forms

There are three ways you can get something from a form, mostly alternatives of code convenience.

#It's possible to do:
firstname = form['firstname']  # gives an object or list as mentioned, 
                               # raises KeyError if not present in form

#Get the first of one or of many:
firstname = form.getfirst('firstname')   #returns None if not present in form

#Get/use as a list 
for firstname in getlist('firstname'):   #returns [] if not present in form
    #do stuff.

I suggest the latter two as they keep your code simpler, and still be robust to getting a different number of values than you may strictly be expecting.


File uploads

As mentioned, the returned value will be of type mod_python.util.Field instead of a StringField. You could assume the form member that should be file input actually is, or test it e.g. like:

uploaded = form.getfirst('upload')
if uploaded.__class__ is mod_python.util.Field:  #and not .StringField
    #do stuff.

If you are not seeing any Field objects when you thing there should be, this probably means the form's encoding type and/or HTTP method is not properly set in the HTML.


mod_python.util.Field has the following members:

  • file: A file-like object, already open. (For small uploads this will be a StringIO object, for larger uploads it's a TemporaryFile)
  • value: file contents - at time of (first) access, the entire file will be read into memory (which may be a bad idea resourcewise!) to back this. If you just want to write away the file, use .file to stay memory-lean.
  • members that come from the browser:
    • name: form input element's name
    • filename: the filename
    • type and type_options: The Content-Type (split into MIME type, and options like charset)
    • disposition and disposition_options: The Content-Disposition

I'm not sure what the best way to get the size is, but I suspect it may be the old trick that is:

formfile.seek(0, os.SEEK_END)
size=formfile.tell()

Cookies

Helper functions to ease sending and parsing cookies. See e.g. these examples.

Note that this interface is a little different from that in the standard python library. For example, the cookie object here represents a single cookie, while in the standard library, which are called Morsels in the python standard library.


A simple example: Import the cookie-handling module:

from mod_python import Cookie

Parsing from a request, and getting a cookie from this object:

cookies = Cookie.get_cookies(req)
if 'thing' in cookies:
    thing=cookies['thing'].value

(it could be that .get() can handle double cookies better, (verify))


Adding a cookie to a response:

thingcookie=Cookie.Cookie('thing', 'I saw a thing. Thing!')
thingcookie.expires = time.time()+10800  #in seconds. Reformatted to the date format cookies use
Cookie.add_cookie(req, thingcookie)

(This may also avoid some trouble with double headers when you set multiple cookies(verify))


Mod_python also allows marshalling into cookies. The examples becomes slightly longer with that.

Sessions

Simple cookie-identified sessions are supported. Whether it memory-only or filesystem-backed (DBM, but memory mapped) depends on the apache MPM; see this.

DBM access is sequential, which is why forked MPMs will allow it but threaded will not, and why it will be very slow if all requests have to wait for others to stop accessing it. On a remotely serious site you should back sessions by a real database. (There is probably a library out there that helps you do that)

DBM files will live between apache restarts but be stored in a temporary directory so not guaranteed to stay alive. The directory can be changed, though, so on simple sites and when using DbmSession, you can get strongly persistent sessions.


Other

See mod_python.util documentation.

Client IP

To get the client's IP:

strIP = req.get_remote_host(apache.REMOTE_NOLOOKUP)

I prefer to get the IP instead of a hostname-or-IP-if-that-fails and avoid a potentially slow DNS lookup, hence the REMOTE_NOLOOKUP.

Self-reference URL

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

mod_python design/configuration

Note: mod_python compiles in an interpreter, and does not use the one on your system. You may well get 2.3 in mod_python while you have 2.5 as your /usr/bin/python. Libraries and site packages still come from disk. (verify) whether I'm full of it here.


Apache: Phases and Handlers

Mod_python integrates into apache's model more than many other modules do, so you also play by its rules more, both in terms of tight integration and in complexity. See this, and possibly also the bottom of Apache config and .htaccess, and note that the exact way the concurrency works on mod_python depends on the apache MPM you are using.


You can replace the default handler or add to others, using SetHandler and AddHandler. Then, while AddHandler adds a handler for an extension; SetHandler sets a hander regardless of extension, so in cases like the publisher, it will try to interpret everything as pyhton, and with -program it will always use the handler script. (verify)

Mod_python can be hooked into various phases. You usually add the main phase with PythonHandler, but can hook into others by setting its friends (for the list, see the links below), to a particular callable object to run. For example, to add a handler for the main phase and for the authen phase:

AuthType Basic             
AuthName "Restricted Area" 
Require valid-user

AddHandler mod_python .py    # add mod_python as handler  for .py extension
PythonHandler       myscript # script to use for the main phase
PythonAuthenHandler myscript # script to use for the authen phase

This implicitly replaces apache's default behaviour of using its own, .htpasswd-based authentication handler. Instead, a function inside myscript is called. (Because of the way the bare handler resolves handler names, two different functions inside the same script can be called.}

For more detail, see:

Dispatchers

The way that code is handled depends on the dispatcher you configure in apache.

Options include: (more detailed notes in sections below)

  • python-program (a.k.a. mod_python, somewhat confusingly), is the lowest-fluff dispatcher. I'll call it the bare handler. It has a somewhat odd way of handling URLs (mostly just leaving it all up to you)
  • There is publisher, which adds some convenience, a different status return scheme, and a URL scheme that is almost even stranger.
  • cgihandler imitates what python as bare CGI would do (except faster)
  • PSP is a template-ish parser a la ASP, PHP, etc.

Spyce (third-party) is another.

  • vampire, which adds rather more sensible URL mapping to various of the above handlers, and some minor web-frameworkey details.


Random benchmark figures copy-pasted from the docs, as some indication of the turnaround overhead of a hello-world handler:

mod_python handler (a.k.a python-program):     1203 requests/sec
mod_python publisher:                           476 requests/sec
mod_python cgihandler:                          385 requests/sec
standard CGI:                                    23 requests/sec


In my not so humble opinion...

If your choice is between the bare handler and publisher, don't choose publisher. It gives you too few extra features (various parts are more different than they are easier or better) for the speed drop. Also, since the documentation was written for the bare handler, it is often not easy to see how something applies to publisher.

Both the bare handler and the publisher have odd URL handling, though. Mostly for this reason, my personal preference is using something like vampire, around the bare handler. This yields what I consider a fair balance of convenience and speed.


cgihandler is perhaps simplest to understand - mostly just a script that is executed for each request. While faster than oldschool CGI, it's certainly no FastCGI / WSGI (You may want to look into CherryPy if you like that idea.).

PSP may be useful to you for the same reasons PHP, ASP, and such are: templates, simple to use, a little less control. I didn't like its syntax/behaviour, but you may.


mod_python, a.k.a. python-program

The barest method you can get at. This tends to be the one used in documentation and often in example code too. (It has been renamed mod_python since 3.1.3. python-program works in all versions, but will eventually be phased out.)


Add the following in your apache config (central or or .htaccess (assuming you have the FileInfo override):

AddHandler mod_python .py
PythonHandler mptest

Example code (mptest.py):

from mod_python import apache
def handler(req):
    req.content_type = 'text/plain'
    req.write('Hello world from mptest.py\n')
    return apache.OK

The handler must be called handler (for the main phase, authenhandler for a PythonAuthenHandler, etc.).


This does not do URL handling: for any request in this directory ending with .py, this imports mptest and calls this specific handler (for the according phase, here the main one), which is probably not what you expect - any varying response to different URLs has to be handled by you.

You can work around this in several ways, including some hacky ways like this, arguably the publisher, or rather more cleanly with vampire.

python-publisher

Builds on the bare handler, adding a little convenience and changing some behaviour. The most noticeable differences seem to be:

  • different URL-to-.py mapping: you can use separate .py scripts, and you can get to specific functions in each script.
  • It uses URLs in a way that makes it very easy to break relative links
  • Has a different return mechanism, and handles statuses/exceptions differently
  • parses form information automatically (this is a single extra line in the bare handler)

Apache:

AddHandler mod_python .py
PythonHandler mod_python.publisher

Some script, for example something.py:

from mod_python import apache
def aFunction(req):
    req.content_type = 'text/plain'
    req.write('Hello world')
    return ' from something.py\n'     #You can both req.write and return

...which you would get to via http://server.name.com/somedir/something.py/aFunction.

Note that functions beginning with an underscore will not be exposed by the publisher.

Before mp 3.2.6, returning None (explicitly, or implicitly by not returning anything) or an empty string from the dispacher handler function yielded a '500 Internal server error'. Since 3.2.6 it produces and empty page.

Vampire

Vampire is an extension that wraps around either of the above(verify) but primarily the bare handler. It:

  • has more natural URL mapping than both
  • allows multiple handlers per directory, and in fact per file
  • does a few more simple web-platform things (see [1]) (though nothing as complex as framework functionality)

Apache:

SetHandler python-program
PythonHandler vampire         #(hand control to this callable object)

At this point, pointing your browser to:

  • index.html will map internally to index.py, function handler_html()
  • fish.txt will map internally to fish.py, function handler_txt()
  • fish will map internally to fish.py, function handler()

(the handlers otherwise act like the bare-style handlers)

Vampire falls back to file serving: if you ask for fish.txt and there is no handler_txt in fish.py, and there is a file called fish.txt, that file will be returned via apache (verify). If a directory with the name exists, it will be viewed using standard vampire semantics(verify).

You may also want to add PythonOption VampireDirectoryIndex index.html, allowing you to use URLS that end in a slash (assuming you add handlers to go along). Note that unlike Apache's DirectoryIndex (which doesn't apply because these are virtual URLs) this option wants one value, not a list.

Others: python-serverpage, cgihandler, spyce

PSP:

AddHandler psp-handler .psp
PythonHandler mod_python.psp

Frankly, if you want templating, use a friendlier alternative such as turbogears's kid.


cgihandler:

SetHandler mod_python
PythonHandler mod_python.cgihandler

This emulates a very basic "I get data, I put data" run-per-request CGI interface. Other than the fact it's slightly less heavy than a simple CGI setup (and potentially noticably faster), it seems of limited value to me, because that speed difference doesn't necessarily justify a decent-sized dependency like mod_python.

It doesn't have the 'modules kept loaded don't reload on import' feature/bug, more or less because of its load-per-request nature (more cgi than fcgi).


spyce:

PythonHandler spyce.run_spyceModpy::spyceMain

...but see the project for actual details.

Returning/raising HTTP states and errors

Note: The following applies primarily to the python-program handler, because:

  • in the bare handler you can both return the errors and raise them
  • in publisher you can only raise them

...so if you think you'll use more than one handler over time, you should probably learn raising. It's also arguably more convenient as it allows you to bomb out from functions called by the handler without the cooperation of the main handler.


Apache behaves in different ways depending on what you return to it. There are some special cases that are not HTTP constants:

  • apache.DECLINED: Fall back on default apache handler (for any phase?(verify))
  • apache.DONE: Stop now, skip further phases (useful for your own error handlers)
  • apache.OK: an 'all is well unless indicated otherwise' meaning:
    • in the authen phase, OK means "yes, authenticate them".
    • In the main phase it will eventually lead to whatever status was set (may be 200, and it may be something else), if all else goes well.

The rest are HTTP status constants, such as HTTP_MOVED_TEMPORARILY (302), HTTP_NOT_MODIFIED (304), HTTP_FORBIDDEN (403) and whatnot, see this list.

The following two are the return and raise equivalents of a HTTP_FORBIDDEN response:

return apache.HTTP_FORBIDDEN
 
raise apache.SERVER_RETURN, apache.HTTP_FORBIDDEN


Raising non-error codes will do what is sensible. For example, you'll likely see the following in cache-aware/ETag-using apps:

raise apache.SERVER_RETURN, apache.HTTP_NOT_MODIFIED


You can also control whether you want apache to serve its ErrorDocuments (skipping all further handlers and phases?(verify)):

raise apache.SERVER_RETURN, apache.HTTP_UNAUTHORIZED

...or write your own error content (e.g. for a 404), while the error will still get logged as such and treated by the browser as such. Example:

 req.status = apache.HTTP_UNAUTHORIZED
 # req.write your own content
 raise apache.SERVER_RETURN, apache.DONE

Note that this example uses apache.DONE to skip further phases. In other cases you may wish to use apache.OK.


Redirects

External redirects (browser redirects) are a combination of a return header and a return status:

req.err_headers_out.add("Location", 'http://www.modpython.org/')
raise apache.SERVER_RETURN, apache.HTTP_MOVED_PERMANENTLY


HTTP_MOVED_PERMANENTLY (301) and HTTP_MOVED_TEMPORARILY (302) are most common, though you could also use HTTP_SEE_OTHER (303) or HTTP TEMPORARY_REDIRECT (307); see HTTP_notes#HTTP_redirect and various online sources.

To do an apache-internal redirect, see req.internal_redirect(new_uri).

Debugging

Stderr goes to the apache error log, stdout doesn't go anywhere we can see. You can redirect both to the browser if you want to, but they are buffered and therefore subject to the flushing decisions of the (sub)interpreter that the error was generated in, which makes that mildly impractical.


Often enough:

PythonDebug On

will do. This causes mod_python to print stack traces in the return document instead of only in the error log.


There is also a way of using pdb, an interactive python debugger. See [2].


through logging

You can also avoid the buffering problem by doing print-style debugging directly into the apache logs:

apache.log_error("I will show up in your error log.") 
# or 
req.log_error("I will show up in your error log.")

There is a difference in where the string goes. My guess is that this is because the req object sits in the context of the virtual server, and the apache module does not.


It may be convenient to log into a database table, as you can filter, make timestamps more readable (with details like '3 seconds ago') though it will obviously not work for a few problems, such as database connection problems.

Mod_python and apache

(Sub)interpreter isolation

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Each apache child process has one main python interpreter, but what you actually use are subinterpreters. Subinterpreters belong to that main interpreter in that they are part of the same process, but they are isolated from each other, meaning each subinterpreter is isolated to handle a vhost (the default, possibly directory if configured that way). They are isolated from each other in terms of global state, module state and such - though the GIL is shared between all interpreters in a single process.


A child handles one request at a time, purely sequentually. Apache hands off requests to whatever child isn't busy at the time, and may create extra childrenif required, within configured limits.

A child and the implied python interpreter started in it is long-lived. An interpreter is stopped only when the apache child is killed, for apache reasons such as that the child has handled a configured amount of requests (usually (tens of) thousands), or that this was an extra child created for temporary large load, and is unnecessary for the current load.


Each apache child can handle all virtual hosts (if you use them), because a sub-interpreter for it can easily be created in each (note that subinterpreters are reused for requests in the same context).

The way mod_python decides to create a subinterpreter, which also has implications to the degree in which state is isolated or not, can be set to work one of three ways:

  • per vhost (default)


There is also the PythonInterpreter directive, which you can set to force multiple vhosts to share the same interpreter, which can be useful if you must have shared state, but is generally not an interesting setting. Note this only works within and not between apache children.


See also:

MPM-specific behaviour

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

See also Apache config and .htaccess#MPMs

General notes:

  • never rely on keeping things around in globals. You can do this, but should know all the limitations.


To mod_python, the important difference between MPMs is whether it is is threaded (e.g. worker) or not (e.g. prefork).


Documentation often describes the workings of non-threaded MPMs like prefork, in which a single child processes a single request at a time, and if you use mod_python there is there is one interpreter for it.


Threaded MPMs like worker have a number of threads per process, which means that apache may delegate different requests to multiple threads inside the same process.

Threading extends into mod_python, by design(verify)). There is also an administrative delegation layer within mod_python, in that you can tell it to create subinterpreters per something.

Still, that means that multiple threads (working within the same subinterpreter) could be executing the same handler at the same time(verify).


Since threads share globals, this has implications to concurrent data access, but writing handlers that intentionally share global state can be handy (and easy enough, with a few locks or such). It also has implications on modules that are not thread-safe (much harder to fix).

If you require shared data to be singular, regardless of MPM in use, you often requires some external communication, whether that is via IPC, (transactioned) shared memory, a memcache, database storage, disk, or some other method.


See also:

mod_python and Unicode

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Mod_python's unicode support is, in a word, missing. This is perhaps no big deal, as you need to know that HTTP is byte based, need to know what you're doing


No default (en/trans)coding

You can of course use Unicode data, but in the design of mod_python, there is no default choice of how data is sent out. This means that byte strings send fine (though without a charset even their meaning aren't too well defined; RFC2616 specifies 'recipient should guess'), Unicode strings do not.

More specificlly, any implicit converions fall back to use python's site.py settings, in which the codec is usually set to 'ascii' (which is the reason python generally complains about printing anything that contains more than ascii).

Editing site.py to get mod_python to behave is not a satisfying solution, because it's playing with global python behaviour, and since you won't always have control over the file, it's not a portable one either.


mod_python output

In terms of output, Unicode support generally consists of informing the browser of the encoding you use, then send all strings encoded that way. (That is, mentioning the charset in the response, and writing out response data in that format)

Given that the only encodings that really work well on Unicode are the UTFs (and GB18030), I would call UTF8 or UTF16 quite sane defaults to use.


Headers

You inform the browser via req.content_type, after the MIME type. Code should strictly check and stick to the the browser's preferences in the Accept-Charset header, but UTF8 is almost guaranteed to be there, so you can usually be a little lazy and just decide to use, for example:

req.content_type='text/html;charset=utf-8'

Note that should be utf-8, not utf8, and not UTF-8 either(verify). Some browsers are more likely than others to trip over this.


Alternative ways of writing

Transparent conversion in req.write (personal preference)

You can assign req.write as a function that wraps the original req.write, and re-codes any unicode string it meets.

This means that any unicode strings you write out are automatically transcoded as UTF-8, and bytestrings (already coded) are left untouched, which seems a nice and simple model.


My own code is currently in the form of a decorator, which I prefer because it's short and centralizes the fixing code (and means I can omit it when I don't want it):

@common.uniwriter()
def handler(req):
    req.write(u'\u2222')
    #etc.

Or, for another content type:

@common.uniwriter(mime='application/rss+xml; charset=utf-8')
def handler(req):
    #etc.


The supporting code could look something like:

#convenience constants
html8 = 'text/html; charset=utf-8'
text8 = 'text/plain; charset=utf-8'
txt8  = 'text/plain; charset=utf-8'
xml8  = 'text/xml; charset=utf-8'
# (note that 'utf8', without the dash, is incorrect, though some browsers accept it)

def u8write(req,s):
    " Outputs bytestrings as-is, unicode as UTF8. Used in the decorator definition below. "
    if type(s)==str:
        req.oldwrite(s)
    elif type(s)==unicode:
        req.oldwrite(s.encode('utf8'))
    else:
        raise TypeError('req.write only takes strings')
        #While debugging, it may be convenient to use: 
        #req.oldwrite(repr(s)) #...for var_dump-like functionality

def uniwriter(mime=html8):
    """ Puts req.write in a wrapper function to encode unicode as utf8,
        Sets mime type to go along (HTML by default)            
    """
    def decorator(func): 
        def wrapper(req):
            req.content_type=mime   # Set the MIME type, with charset
            req.oldwrite=req.write  # keep reference to the actual writer around
                                    # (u8write uses it, and you may sometimes want to)
            req.write = lambda s: u8write(req,s) # replace req.write with our wrapper
            return func(req)
        return wrapper
    return decorator

Collecting strings

you can keep a list of strings and eventually doing a join-encode-req.write().

This can be preferable when you want slightly more portable code, as it takes little effort to drop this into frameworks that expect handlers to return the result document text (and do not allow streaming bits out as they are produced).


Specifically choosing the list-of-strings approach (over string appending) is for efficiency: because strings are immutable, concatenating bits onto a string with + or += continually creates new objects, so is inefficient when you handle large documents made from many small parts (see string concatenation alternatives).

So, for example:

def handler(req):
    req.content_type='text/html;charset=utf-8'
    out=[] #start an array  
     
    #append (unicode) strings to it
    out.append('<html><body>')
    out.append(u'\u2222')
    out.append('</body></html>')
     
    #write it all out in one big chunk 
    req.write( ''.join(out).encode('utf8') )


Explicitly add .encode('utf8') to every req.write()

One way is to explicitly .encode('utf8') all strings you hand to req.write, but I guarantee you'll occasionally forget it, and/or mess up the interaction with cgi.escape() or urllib.quote().

It's cluttery, verbose, redundant, and error-prone: You could accidentally mix unicode and utf8 data by forgetting to encode something.


Object model

Build up your page in an object model that does all encoding for you.

This can be quite elegant when it applies to your uses, but it tends to be bothersome in practice, as you may end up doing a lot of work just trying to fit in things to a specific model's way of doing things, particularly when you have dynamic content.

escaping, url encoding

When building up a page, you will likely also regularly use to PCDATA escaping and attribute escaping, as in e.g.:

newthing = "the <new> thing"
'<a href="http://somwehere/s=%s">%s</a>'%(urllib.quote(newthing),cgi.escape(newthing))

will result in:

'<a href="http://somwehere/s=the%20%3Cnew%3E%20thing">the &amp;lt;new&amp;gt; thing</a>'


The urllib.quote() function became stricter in python 2.4.2 and will not pass through unicode. This is closer to specs, but it means you have to wrap it yourself, for example applying .encode('utf8') before you quote. (The same goes for urllib.urlencode) You can make this a little less tedious by using:

def utf8quote(s):
    " returns  urllib.quote(s.encode('utf8')) "
    return urllib.quote(s.encode('utf8'))
 
def utf8dictquote(d):
    " Acts like urllib.urlencode (url encode for dict) but encodes vars and values as utf8 "
    parts=[]
    for var in d:
        val=d[var]
        parts.append( '%s=%s'%(utf8quote(var),utf8quote(val)) )
    return '&'.join(parts)

Input

Decoding form input

There is no standard for unicode strings in URLs. The usual application convention is to UTF8-encode, then url-escape it.

However, when you paste unicode into the location bar on an URL that the browser doesn't know, most default to use codepage 1252 a.k.a. windows-1252 for backwards compatibility reasons. (See also common character sets)

I like to be safe and be robust to both cases.


The following function also eases sanitation somewhat:

def getfirst_unicode(form,var,absent=None):
    '''
       A utf8/cp1252-decoding version of form.getfirst(). 
       Almost all cp1252 is invalid as utf8, so cp1252 is the fallback.
       
       If var is not in the form, it returns the absent value 
       (By default this is None. It may be convenient to use u'' or even 0)
    '''
    s=form.getfirst(var)
    if s==None:
        return absent
    try:
        s=s.decode('utf8') 
        #standard behaviour is to raise the following exception on bad bytes:
    except UnicodeDecodeError:
        s=s.decode('cp1252') 
        # It won't except for any str-type data - the nature of codepages.
    return s

(getfirst because that's what I use in practice. getlist_unicode and such can be written analogously)


The point of the absent parameter is that you can have a default value in case it was not supplied. In other words:

 int(getfirst_unicode(form,'amount',1))

...replaces code you would otherwise have to write as something like:

amount=getfirst_unicode(form,'amount')
if amount==None:
    amount=1
amount=int(amount)

This doesn't catch the case of the conversion borking (getting non-number input). This could also be handled in the function, but wishes will vary - if you get such mangled data, do you want a default value instead? A value representing 'bad'? A proper error/exception after all?

Coders may want to write fixes around specific cases (e.g. go to defaults), and/or wrap a simple try: ... catch ValueError: around all and display a "You did a bad thing" page.


The error robustness behaviour can be argued about.

  • .decode('utf8') is strict so may raise on invalid byte sequences
  • .decode('utf8','ignore') doesn't, but eats bytes that didn't end up being decoded which may be unrelated bytes (in the worst case a < in xml, invalidating XML structure because of invalid contained text)
  • chardet is a little heavy for just string input, and is not too useful on very short strings anyway
  • decode('cp1252') won't raise, but may not be correct

For this reason, I try utf8 strictly, then fall back on cp1252/latin1. It'd rather show bad characters than fail.




mod_python and importing

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

This section is known to be incomplete and may mislead you as to your options.


This was written to detail mod_python's automatic reloading, which

  • is based on .py modification time and
  • only directly applies to modules with handlers.

Those and a few other details mean the reloading may seem to act somewhat non-deterministically. The simplest way around this is a manual one, touch *.py. Automatic solutions are harder.

This and various other apsects of (mod_)python importing and reloading are detailed below.


Cause: regular python behaviour

Basic python loading

With basic Python, the first time you import a particular module, that module will be loaded and assigned to sys.modules, and (depending on the exact import statement), some name binding will occur. Imports after this will notice the module is present (in sys.modules, based on name) and have no effect beyond name binding.

That is, when you use the import statement, python will never reload an already loaded module, or even hit the disk at all. This is sane for most any program, including regular python: dealing with reloading interdependent things is ill-defined and error-prone in almost every language out there. (Programs generally do not load new code/DLL from disk unless they are stopped and started again, or unless it is strictly formalized via a plugin system or similar)

Being dynamic, python does allow allow you to call reload(module), which explicitly reloads the module based on the current source code on disk. However, reload() has some of the inherent problems just mentioned, and is being removed from Python3K because these problems are not solvable without a full (re)definition of importing semantics - and reloading is rarely necessary.


mod_python loading

One reason mod_python is pretty snappy is that apache keeps interpreters loaded in its processes, meaning interpreters are are long-lived, serving many separate requests, and has similar import-once-and-ignore-later-imports behaviour.

This is efficient in production since module loading is both lazy and happens at most once (per (sub)interpreter). However, it is a royal pain while developing, as arbitrarily old code may easily apply.


reloading

Now, mod_python does include a basic reloader in that checks the change time of handler modules, and reload()s when they are newer/different.

(You can disable this for more basic-python behaviour; see the PythonAutoReload directive, which is On by default. Note that setting this to Off does not affect vampire; you have to use a vampire-specific way of freezing the modules)


Reloading is a handy feature in development, but the fact that it only applies to modules that the dispatcher is directly looking at (apache-setting-based or URL-based, depending on dispatcher) means the following two major annoyances:

  • references to functions in other handlers, and particularly in helper modules, may easily be older versions.
  • module references inside helper modules go out of date

This because:

  • helper module import is commonly executed at module-global level, so only executed when the module itself is (re)loaded
  • the record of loaded modules is interpreter-global, so if you use basic-python import, this will do nothing (as per basic python loading semantics) even if it is executed

It will even look non-deterministic, because requests you will not necessarily land in the same interpreter, meaning, mostly based on whether a given one is busy. Reloading quickly may cause apache to send the request to a different interpreter, or not.


The (re)loader for earlier versions of mod_python was only a thin wrapper around basic python loading, meaning a number of problems were inherited from it. Later versions are more complex, but simply cannot work around the fact that reloading is inherently ill-defined.

Problem summary

Different versions of reloading in mod_python have different problems. A summary of the major problems follows:


Persistent/stale global state

When you tell python to reload(module), it does not start anew. It will read the .py file as it normally would, but will do so in the context of the old module. Anything with the same names will be overwritten, but anything not explicitly overwritten (e.g. by initializers at global scope) will just stick around.

You can sometimes use this as a feature, for example creating a simple/small/short-term memcache, as long as you don't mind it being per-interpreter.


Note that import_module in mod_python 3.3 clears the module before reloading, to avoid problems stale state may cause.


Stale helper modules

The last problem implies another. Consider that a handler module will import modules loaded in turn, probably including some organizational/helper child modules you wrote.

Now, automatic handler module reloading isn't too hard, and in fact the default, but global state includes sys.modules, python's record of loaded modules. If you change only a helper module and the handler module only imports/reloads it at global scope, there is no reason for that helper module to get reloaded.

There are some related side effects. For example, python's pretty-printed stack dumps are obviously caused by compiled code, but the source it shows in them come from the .py file on disk. If you don't have up-to-date reloading, that report may be out of sync and look quite confusing.


The simple solution is to touch *.py after every helper module change (and do all imports with reloader code, so they notice that mtime change), since the reloaders watch module mtime. (You may find it handy to write an URL-exposed function that does that touch)

If you want to avoid even doing the touch, you would have to write more involved code that checks on, say, each handler call. (Note this would be somewhat wasteful in disk hits, unless you have some way of disabling it)


Using a reloader that doesn't set sys.modules means the 'loaded modules' state is not interpreter-global state anymore, so if your helper module import each other, you have to use reloader code in them to make sure such cross-references stay up-to-date. (This complicates command-line use as well)


Name collisions, accidental masking

Plain import, as well as import_module in ≤3.1 which were thin wrappers around the basic python module loading, use sys.modules to keep track of modules.

sys.modules is python's map from module names to loaded modules, which implies there can only be one module with a particular name (in a particular (sub)interpreter). One upside (to writing reloading code) is that reloading automatically applies globally, (also meaning you could mix a reloader's import with basic import, because it would only bind the name), but the downside is that since python doesn't deal with directories as namespaces, meaning the name-as-identifier idea becomes a problem in mod_pyton.

After 3.1, things made more sense, although the semantics (necessarily) changed.



user/system and cross-directory masking

Given a single module search path, path precedence you may mask a system module with one of your own, or your own may be masked by a system module. In fact, this is two separate problems:


The first is that you don't know which one you will get. Older loaders would use sys.path, newer ones specifically don't and allow or require a path parameter. This:

  • allows you to at least reference the right module,
  • allows code to not have to be in site-packages or your web directory
  • means that you can put modules elsewhere in your filesystem (which can be handy if you have anything security-sensitive, such hardcoded database logins in a config.py, that you do not want apache to deliver as text when you screw up your apache configuration)


The other problem is that even if you point import paths at the right things, sys.modules will mess things up if there is more than one module with the same name. Consider, for example, that you'll probably have more than one index.py when you have more than one directory.

This applies to any import/reloader that checks and assigns to sys.modules, which includes basic imports (...which you wouldn't use on index, but still) and earlier mod_python reloading (which implicitly imports index).

If you do not explicitly reload, a different module will be masked completely by whichever was the first with that particular name that was loaded. Explicit reloads will mask the other module(s) with the same name arbitrarily (whenever you cause such a reload)

The per-www-directory problem can be solved by using a configuration that forces a subinterpeter per directory, instead of the default of one interpreter per (v)host; see details below, but this does not solve all problems.


Newer reloaders will not assign modules to sys.modules, instead only returning the reference to a new module (and recording a reference in a shadow cache). This avoids directory-duplication trouble, but means that you usually have to have code like:

mypath = calculate_or_hardcode
myhelper      = specificReloader.importFunction('myhelper', mypath) 
myotherhelper = specificReloader.importFunction('myotherhelper', mypath)
# path may be optional

...in every handler module, one for each module you want to ensure is up-to-date, and and assist it with a touch *.py whenever necessary


Can't easily share code between mod_python and command line

The simple solution above told you to do all imports via the mod_python reloaders. The mod_python module is not present outside of mod_python, but without the reloading problems, the lack of that module is only a small hiccup. You probably reference apache inside handlers (e.g. return apache.OK), so could put from mod_python import apache in each handler instead of at global scope.

Note that various of the more modern reloaders can alter the loading behaviour that backs the import command (often optionally, e.g. through VampireImportHooks), meaning you use 'import' and get the loader you want.


If you cannot use that option for some reason, you would e.g. have to have logic that checks for a given reloader exists and falls back to basic command line importing. (And it is somewhat more verbnose if you want that code to distinguish between ImportErrors caused by the absence of mod_python/apache/vampire and ImportErrors caused by errors (e.g. syntax errors) in the helper modules.)


Smaller problems

The 'has the .py file changed?' logic used to be 'is it newer?' , which would cause code restored from backups not to be loaded until you touched it. Since 3.2, any mtime change triggers reload.


In apache.import_module before version 3.3, the default arguments to import_module() meant that the option to auto-reload and to log reloads could not be centrally controlled.


The PythonAutoReload directive was broken in that reloading could not be turned off, between 3.1.3 and 3.2.7(verify) (see also [3]).



Data/module state sharing (MPM details)

See mod_python (mostly #MPM-specific_behaviour) for details on MPMs and subinterpreters, and the effects it has on handlers sharing more or less than you think they would.

Modules themselves also affect this behaviour, when they store global state -- and it affects them when they rely on global state


In threaded MPMs, the module reloading check isn't exclusive, so two threads may reload a module in quick sequence. The actual reloading is thread-safe, but it's unnecessary work. Fixed in 3.2, and fixed slighly better in 3.3.

Different import systems and behaviour

basic python import

Loads once, see also intro above.

Vampire: vampire.importModule() and VampireImportHooks

For users of vampire: it has its own module importing system, which automatically reloads handlers based on modification time and is apparently smarter about dependencies than the 3.1 apache.import_module() system.

To reload indirect dependencies by code, you can use its importing/reloading function, vampire.importModule('modulename',path), where the path is required.

The 3.3 importer seems to be based on this, though is not compatible with it - you should not use apache.import_module when you use vampire; it seems to lead to occasional exceptions.


If your apache configuration mentions:

PythonOption   VampireImportHooks On

this means the 'import' statement will use the Vampire reloader instead of the basic-python __import__. Note, however, that it does not have the same capabilities; it does not deal well with packages, for example.


If you want an auto-reloading solution, you could do something like:

#at global scope:
def _load():
    global modul1,modul2
    import modul1,modul2

_load() #initial load (which is technically even optional)


def handler_html():
    _load()
    #...rest of handler

This way, you only have to declare the imports once, you can easily call it in handlers, and they get set at global scope.

apache.import_module

apache.import_module('modulename') is an importing function that will import a module, and if something by the name is already loaded, it will check the relevant file and reload it only when its modification time has changed.

It sets a an mtime variable that is unknown to import, so while mixing import_module() with import won't break but may cause more reloads than you think.

While you can use import only for system libraries and site modules, but you should only use apache.import_module('modulenamestring') for your own modules.

You can use the path argument (note: must be a list), which tells the function to load only from this directory - this avoids sys.path elements taking precedence.


in 3.1

This version of import_module was a very thin wrapper around python's own import mechanism. In this version, the function was defined as:

import_module(module_name, autoreload=1, log=0, path=None)

The backing logic seems to be (verify):

  • if the path argument is supplied, look only there
  • else the name will be searched for in sys.path (which can be altered via apache using PythonPath)

Also:

  • adding autoreload=0 means no reloading even if it has changed on disk. The apache-config PythonAutoReload will affect this.
  • adding log=1 will cause printing of debug info to the apache error log. The apache-config PythonDebug will affect this.
  • autoreload and log default to their defined defaults when you use this function explicitly, meaning that PythonAutoReload will only apply to internal reloading.

See also:


in 3.2

3.2 is probably best skipped. While it has a number of fixes over 3.1, it still had some bugs that 3.3 does not, so you likely want to upgrade to that in one go.


in 3.3

The function is defined somewhat differently:

import_module(module_name, autoreload=None, log=None, path=None)

The logic seems to be (verify):

  • module_name may now be is an absolute or relative(verify) path to a module. If it is, only that is used
  • else if the path argument is supplied, look only there (absolute only?(verify)
  • else check paths in the value set in mod_python.importer.path (can be set via the apache PythonOption directive)
  • else fall back to a looking in sys.path {{comment|(alterable using PythonPath) (is this a basic import or not?(verify))

Also:

  • A module is cleared before it is reloaded - old variables won't clutter up/mislead you
  • the autoreload option now listens to PythonAutoReload even on explicit calls (...since autoreload, and log, now don't have those function-prototype defaults)


Note that import_module will do a disk stat() regardless of autoreload parameter and PythonAutoReload ((verify) whether this is 3.1 only or also true in 3.3). A stat() per module per page load is usually acceptably light (much lighter than reload()ing everything, anyway) and correct behaviour while you're developing.

See also:


moving to 3.3

Assuming you've changed all imports of your own modules to import_module() calls, the move to 3.3 mostly involves setting mod_python.importer.path:

PythonOption   mod_python.importer.path "['/var/www/something']"

(double quotes are for safety, to force apache to consider it a single argument)

You can live without mod_python.importer.path if you want to by using import_module() with a path


If you have unported code (or you use vampire's import hooks(verify)) that uses plain import and you want to be lazy, you can set sys.path instead, to have 3.3 mostly act like pre-3.3:

PythonPath sys.path+['/var/www/whatever']

Or via code in your handler modules using

sys.path.insert(0,'/var/www/whatever')

Of course, you're screwing up the import/import_module() separation, so when you break something by doing this it's now squarely your fault:)

Solutions

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

You have to be aware of what each solution actually does, and it's hard to solve quite all the mentioned problems at once (particularly if you want command line importing).


If a module of yours won't change, you can consider moving it outside outside of the directories in which the reloading applies (it's probably a helper you may want for more projects anyway).


If you have helper modules importing each other, you effectively have a dependency graph, so any solution would have to traverse this, and/or re-bind names on reloads. Traversal is problematic with packages; avoid using them in the context of reloading.


Basic helper module reloading

touch *.py is your friend.

To make mod_python handler modules properly check your helper modules, you should only import them with whatever reloader applies, and touch *.py whevener you change things.

Somewhat more thoroughly than touch *.py would be to use a script that contains something like:

find . -name '*.py[co]' -print0 | xargs -0 rm
find . -name '*.py'     -print0 | xargs -0 touch


In each module:

In the case of vampire (these days you should use either mod_python 3.3's reloader, or vampire's, they're cleverer than previous ones), the code to add to each module would look something like:

import vampire
modpath  = os.path.split(__file__)[0] #in same directory (this imitates basepath)
db       = vampire.importModule('db',modpath)
common   = vampire.importModule('common',modpath)

Without vampire, it'd be apache.import_module, and the path would not be required (but useful, particularly in 3.3).


One downside to this approach is that handlers that import each other, and with reloaders that do not set to sys.modules, module cross-references may be stale, so you would be forced to use the reloader code even in helper modules.

Separating handler modules from helper modules (to be able to test helper modules from the command line interface) would not work since you would implicitly rely on mod_python module.


...allowing for command line import

All all modules that use the reloader will need this fallback, or the use on the command line will fail anyway as even if import is used, there is indirect dependency on apache.

outside_apache=False  
try: 
    import specificReloader
except ImportError:
    outside_apache=True

if outside_apache: # basic imports will work fine (as long as all modules have this fallback)
    import myhelper,myhelper2
else: 
    myhelper  = specificReloader.importFunction('myhelper')
    myhelper2 = specificReloader.importFunction('myhelper2')


This sort of code suggests that it would be nice to have a somewhat more centralized and configurable reloader, to avoid the verbosity as well as the per-module hard dependency on a specific reloader.

PythonInterpPerDirectory

You can solve the cross-directory name duplicate problem by using PythonInterpPerDirectory (see mod_python notes). You will have to be slightly more specific in cross-directory imports and probably use the newer reloader, but this is a good idea anyway as such a directory structure is probably part of your appliation.

This comes at the cost of

  • some (avoidable, but also small) overhead in the creation of the extra sub-interpreters
  • not being able to share data via interpreter globals across directories (minor detail; you should not rely on that anyway)


Crazy apache solutions

The simplest but rather elephant-gun solution to most of the problems is to restart apache after every code change.

You could also set up a separate development server with MaxRequestsPerChild 1, which implies CGI-like behaviour as it creates and kills an apache child and implied python interpreter for each request. However, this considerably slows down everything apache does. (Also, I think it won't work on threaded MPMs)

These are far from convenient on a server that hosts other (production) vhosts, though you could consider them if you have a personal web server used purely for mod_python development (possibly behind a reverse proxy).


save-and-refresh (for helper modules only)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

It is possible to get save-and-refresh behaviour in various ways -- but note that this is convenience more than real necessity; it does little that touch *.py doesn't do.


The basic idea is that each handler module executes the realoading.

The following example assumes you want to apply this only to explicitly listed helper modules, where you specifically avoided name duplicates:

from mod_python import apache
def reload_volatiles():
    ''' does a reload-if-changed for all explicitly loaded modules '''
    #path=os.path.split(__file__)[0] #in case of vampire
    for module_name in (
        'config',
        'common',
    ):
        apache.import_module( module_name )
        #or  vampire.importModule(module_name, path)


def reloader(f)
    '''
       Decorator to wrap mod_python handlers, 
       to the end of reloading modules before that handler starts. 

       (Note: the way mod_python reacts to differently defined handlers
              (the form-variable feature) seems to make it impossible to
              use the usual *args,**kwargs declaration.)
    '''
    def reload_wrapper(req): 
        apache.import_module('debug').reload_volatiles()
        return f
    return reload_wrapper

...stuck in a separate module, say debug.py. In 3.3, you would want to assign the modules to sys.modules, in In 3.1, this happened automatically.

And write your handlers modules something like:

import config,common 
#For the references. they stay current if the reloader assigns to sys.modules 


@debug.reloader
def handler(req):
    #...handler as usual
    return apache.OK


# Which is roughly equivalent to adding...:
def handler(req):
    apache.import_module('debug').reload_volatiles() # <-- this line
    #...handler as usual
    return apache.OK

# The decorator means all reload-framework-specific code is in one place, 
# easy to change, and useful if you don't want to deal with explicit paths,
# though not necessarily so nice if you do.

The upside is that handlers always use the latest version of helper modules, even without touch *.py.


The downside is that it doesn't deal well with decorators, particularly those that add wrappers to handlers, because those wrapper functions are instantiated and assigned, and re-defining the function they came from doesn't actually change them.

This implies that if you change decorators (such as that reloader), you can't avoid using touch *.py on the handler module(s), so if you use decorators (and frankly, they're really useful to add some features mod_python is missing) you could decide either that this is a necessary evil, or that this is an unnecessarily specific/complex solution.



Unsorted

The mod_python module embeds the Python interpreter, so it can only use the python version you compiled it against, and while that makes it independent of the executable/library form of the specific python installation(s) you have installed, it still uses site-packages.


You can see the apache configuration using apache.config_tree(); see http://www.modpython.org/live/current/doc-html/pyapi-apmeth.html


Frameworks you can use in/on mod_python

Vampire is a handler, but also some basic platform-esque

For other, see e.g. http://wiki.python.org/moin/ModPython


Of course, you can use mod_python as a frontend for WSGI and such; read Python notes/Web#WSGI

(Note that while you can use mod_python and mod_wsgi at same time, mod_python it may break things like WSGIPythonHome because of precedence rules(verify).)


Errors

411 Length Required

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

"A request of the requested method POST requires a valid Content-length."

A bug in mod_python?


501 Method Not Implemented

Has several possible causes:

  • if you are using the publisher, this is likely because you have received non-form data. The publisher unconditionally tries to parse POST data as form data, so there seems to be no workaround that allows you to POST other data. See also http://issues.apache.org/jira/browse/MODPYTHON-29


"make_obcallback: could not import mod_python.apache"

Means that while mod_python apache module is loaded, it can't find the supporting modules. The problem is often specifically that can't find them in the location where the version of python that mod_python was compiled for expects them to be.

It can have various causes, including not having installed them (when installing from source), or having updated your system packages for a python version beyond the one embedded in mod_python. It's also possibly you've properly upgraded mod_python but the old version is still loaded because you have not restarted apache yet.


Summary of request, server, and other object( value)s

Useful when you're looking for that elusive value or function you know must be in there somewhere.

def handler(req):
   import mod_python,sys
   req.content_type='text/plain'

   req.write('Python version: %s\n\n'%str(sys.version_info)) #might be handy to know

   to_detail = (('request object',req),
                ('connection object',req.connection),
                ('server object',req.server),
                ('mod_python module',mod_python),
                ('mod_python.util module',mod_python.util),
                ('mod_python.apache module',mod_python.apache),
                #('mod_python.cache module',mod_python.cache), #not so interesting...
                #('mod_python.importer module',mod_python.importer),
                #('mod_python.publisher module',mod_python.publisher),
               )

   for oname,o in to_detail:
      req.write('  === %s ===\n\n'%oname)
      for name in dir(o):
          if name.startswith('__'): # not interesting. Keeps things that start with a single _
             continue

          req.write('%-30s'%(name+':'))
          req.flush() #we still see the variable name in case of problematic getattr-ing

          if name in ('allowed_methods','allowed_xmethods','boundary','content_languages'):
             req.write('(skipping)\n') #getattr-ing these from req seems to be problematic
             continue

          val = getattr(o,name)
          if callable(val):
             if name.startswith('get_') or name.startswith('is_') or name.startswith('auth_'):
                # these seem to be safe and useful to get
                req.write( 'function, returns %s\n'%str(val()) )
             else:
                # most don't get called, just mentioned as being a function
                req.write('function\n') #mention that they are functions
          else:
             req.write( '%s\n'%str(val) )
      req.write('\n\n')

   return mod_python.apache.OK

I'm playing safe with some functions because I seem to remember some caused side effects. Or excepted, I can't remember now.

See also