Python usage notes/Networking and web

From Helpful
Jump to: navigation, search
Various things have their own pages, see Category:Python. Some of the pages that collect various practical notes include:

requests

urllib2 is a standard library, but its API is a little tedious for both basic and various custom stuff (see below)

requests is simpler, and now frequently installed already(verify). For example:

span style="color: #483d9b;">'http://api.ipify.org'

See:



URL fetching - urllib, urllib2, requests, ...

You may also be interested in:


urllib2

Right now, URL fetching is probably most commonly done with urllib2.

create a Request object (...optionally alter it to change exactly what request to make)
do the actual request on the network
read out the response (often just for its body data)

The simplest version is something like:

 

Comments:

  • on exceptions:
    • The most common exceptions are probably socket.error, urllib2.URLError, and httplib.HTTPException. You can separate them out if you want different handling/logging for each.
    • HTTPException lets you .read() the error page contents.
    • HTTPError is a more specific subclass of URLError - in case you want more detailed cases (e.g. split out timeouts). For error reporting, note that the former has e.code, the latter e.reason.


httplib

If you want POSTs with arbitraty data, as some protocols require, you'll notice urllib2 only does POSTs with application/x-www-form-urlencoded.

You'll probably want to use httplib to write slightly lower-level code, for example:

conn = HTTPConnection('localhost''POST','/path'


urllib2 fetcher function

I often use a helper function that makes various common things simpler. At one point it looked like:

span style="color: #483d9b;">""" Does a HTTP fetch from an URL.   (Convenience wrapper around urllib2 stuff)
        By default either returns the data at the URL, or raises an error.
 
        data:          May be
                        - a dict               (will be encoded as form values),
                        - a sequence of tuples (will be encoded as form values),
                        - a string  (not altered - you often want to have used urllib.urlencode)
                        When you use this at all, the request becomes a POST instead of the default GET
                           and seems to force  Content-type: application/x-www-form-urlencoded
        headers:        a dict of additional headers.
                          Its values may be a string, or a a list/tuple (all will be add_header()'d)
        raise_as_none:  In cases where you want to treat common connection failures as 'try again later',
                          using True here can save a bunch of your own typing in error catching
 
        Returns:
        - if return_reqresp==False (default), the data at an URL in a string
        - if return_reqresp==True,            the request and response objects 
        The latter can be useful reading from streams, inspecting response headers, and such
    """# allow multiple values for a header name
#  (emit as multiple headers)
# assume single string.  TODO: consider unicode
#print 'Networking problem, %s: %s'%(e.__class__, str(e)) # debug
# Example:
'id','fork22'),  ('submit','y''User-Agent':'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
 
htmldata = fetch('http://db.example.com'



CLOSE_WAIT problem

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


It seems that older versions of urllib2 had a bug where it left some uncollectable objects behind (including an open file descriptor), which could cause a few different problems.

See also http://bugs.python.org/issue1208304


Apparently it caused problems in unices up to 2.5ish, windows up to ~py3(verify)

TODO: check history - when was this fixed, and to what degree did this vary with OS?



One result was that a connection would linger in CLOSE_WAIT state (meaning it's closed on the other end, but not by us), which can potentially be a problem for both ends if they are very busy connection-wise and/or these connections may stay around for very long.

When you cannot upgrade past this bug or don't want to assume everyone has new versions, there's a fairly simple code fix too (setting some reference to None when you are done with the handle and will be closing it). I put it in my convenience function.

try: # where handle is the object you get from urllib2's urlopen
    handle.fp._sock.recv = None
except: # in case it's not applicable, ignore this.
    pass


Timeouts

By default, sockets block without a timeout, which may cause processes to hang around too long.


Since Python 2.3, you can set an interpreter-wide default for anything that uses sockets:

 

The fact that is is interpreter-global is hairy - the last value that was set applies. Particularly threaded programs can become less predictable.

But more to the point, some protocol implementations may rely on this global being sensibly high, so setting it too low can break things.


Since python 2.6, you have per-socket control, though few higher-level functions expose it.

See 'Socket Objects' in the documentation, and perhaps 'Socket Programming HOWTO' for more detail.


DNS lookup

Much of the socket library is specific to IPv4. The exception is socket.getaddrinfo in it follows the system configuration.

I've used a convenience function like:

span style="color: #483d9b;">""" Looks up IP for hostname  (probably FQDN)
 
        This also accepts URLs, to save you some coding
        (looks for presence of :// and picks out hostname using urlparse)
 
        Returns an IP in a string,
                or None  in case of failure (currently _any_ exception)
 
        prefer_family can be a socket.AF_* value (e.g. AF_INET for IPv4, AF_INET6 for IPv6), 
                          or -1  for don't care
    """'://'# is a URL
'@'# take out  user@  or  user:pass@  if present
'@'# Assumptions: first member of sockaddr tuple is address (true for at least IPv4, IPv6)
            #              'prefer' here means "accept any, but break on the first match"
"Exception"# examples:
dns_lookup('google.com''http://google.com')
 
# ...which on my initial test system returned  IPv4 addresses for both,
# because I don't have IPv6 enabled


Python library details

cgitb

You can hook in the cgitb module to give useful error reports (instead of 'internal server error').

The simplest use in bare-bones CGI is:

 


In other frameworks it needs a little more work, because you usually don't want it to write to stdout (its default) and because you may wish to catch exceptions in a different way.

You might wrap your real handler call in something like:

span style="color: #a05050; font-style: italic;"># it may help to force content type to 'text/html' through whatever mechanism you have
#the interesting line


Forms, File uploads

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)



Parsing an URL, resolving relative URLs

The urlparse module (urllib.parse in py3k) splits a URL into its various parts, can join them again, resolve relative URLs, and such.

urlparse() splits a URL into a tuple: (scheme, netloc, path, params, query, fragment). The object it returns is actually a subclass of tuple that also contains the same information as members.


urljoin( context_url, path) also deserves mention, being useful to take links that may be relative or absolute paths, or entire URLS, and resolve them into full URLs (in the context of the page's URL). The behaviour is basically what you would expect:

span style="color: #483d9b;">'http://www.example.com',  '')  == 'http://www.example.com''http://www.example.com/', '')  == 'http://www.example.com/''http://www.example.com',  '/') == 'http://www.example.com/''http://www.example.com/', '/') == 'http://www.example.com/''http://www.example.com',      '/bar') == 'http://www.example.com/bar''http://www.example.com/',     'bar')  == 'http://www.example.com/bar''http://www.example.com',      'bar')  == 'http://www.example.com/bar'
# starting from page
'http://www.example.com/foo',  'bar')  == 'http://www.example.com/bar''http://www.example.com/foo',  '/bar') == 'http://www.example.com/bar'
# starting from directory:
'http://www.example.com/foo/', '/bar') == 'http://www.example.com/bar''http://www.example.com/foo/',   'bar')      == 'http://www.example.com/foo/bar''http://www.example.com/foo/',   'bar/')     == 'http://www.example.com/foo/bar/''http://www.example.com/foo/',   'bar/boo')  == 'http://www.example.com/foo/bar/boo'
# absolute:
'http://www.example.com/foo/',   'http://elsewhere.com')  == 'http://elsewhere.com'

WSGI

See CGI,_FastCGI,_SCGI,_WSGI,_servlets_and_such#WSGI, mod_wsgi notes and some notes elsewhere (e.g. those on the CherryPy page)

Servers

Note this section describes things that either are, or have their own servers. Frameworks focus more on the logic on top of them, though since they sometimes depend tightly on some server or other they are by no means incomplete.

CherryPy

Various WSGI

See CGI,_FastCGI,_SCGI,_WSGI,_servlets_and_such#WSGI

Twisted Web

  • http://twistedmatrix.com/projects/web/ (Twisted/Web)
  • Twisted is higher-level networking things, but also has a HTTP server - though I'm guessing you still need to work in a twisted sort of model:
  • Providings concurrency without threading: is event-based and requires any potentially blocking things to be deferred (verify)
  • Looks like it'll work under most OSes, with a managable extra dependency here and there.


fapws3

http://www.fapws.org/

mod_python

  • now a dead project
  • can be quite fast as persistent interpreters handle many requests over time
  • A few different handlers, including PSP templating (note that Vampire is a useful addition to these)
  • has some peculiarities you have to work around
  • not actively developed anymore

BaseHTTPServer and friends

  • In the standard python library, so always available
  • Consists of quite basic handling for HTTP requests.
  • You have to decide how to decide what to do based on incoming URLs yourself (no 'run script if there, fall back to file serving' logic at all)
  • A single-purpose server like this can be written in a dozen or two lines, see the example.
  • Not threaded, asynchronous: things are served in sequence, so if one handler is heavy, it will block the next requests.


There are also SimpleHTTPServer and CGIHTTPServer, building on (and interface-compatible with) the BaseHTTPServer:

  • Both in the standard library
  • Simple... serves files based on a simple URL-path-to-directory-path map.
  • CGI... adds the ability to run external scripts for output. (except on MacOS (pre-X or general?(verify))). It sends a 200 status, then hands output to the executable (so it doesn't serve all needs).


You can make servers using the above thread and fork using using the two SocketServer MixIns. It seems to make the TCPServer (that the HTTPServers are based on) wrap requests in a function that creates a thread/forked process, then calls the actual handler.

Minimal code for the basic, threaded and forked versions:

span style="color: #483d9b;">" Not on windows, of course, since it doesn't do forking. """" Hello-world type handler """# only imported for the print below
'true''name''name'"Hello WWW, hey %s""%d threads""No name given.")
 
# Choose one:
#srv = HTTPServer(('0.0.0.0',8765),ExampleHandlerSet)
'0.0.0.0'#srv = ForkingBaseHTTPServer(('0.0.0.0',8765),ExampleHandlerSet)
#there are alternative ways of starting, this one's the shortest to type
A very basic handler like this is pretty fast.
ab2 -n 1000 -c 1 http://localhost:8765/name=Booga
shows that 99% of requests are served within 1ms, or 6ms when causing the 404. I'm not sure why the other 1-2% took up to 100-300ms. Perhaps some occasional cleanup.


About the threaded versions:

  • This is taken from an example apparently quoted from the book "Python Web Programming," along with the note that the MaxIn class should be the first in that inheritance list.
  • As always with threading, watch your race conditions and all that mess.
  • This sort of concurrency isn't not geared for speed. The treads aren't as lightweight as they could be nor the cheapest solution in the first place.
  • You can't control the amount of threads (without implementing threadpooling) and various OSes start having problems above a few thousand threds, so this won't scale to very heavy load.
  • On windows (Python 2.5) it started refusing connections when I ran ab2 with a concurrency of 6 or higher. I'm not sure why (could be windows being pissy, perhaps the firewall)


Something based on asyncore will be more efficent and still based fairly simply on python internals. Medusa does this.

Medusa

Frameworks

This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software).

The more complex frameworks often employ a (framework-wise) MVC model in that their major components are usually:

  • an Object-Relational Mapper,
  • a templating framework, and
  • their own URL dispatching and/or webserver

...and often various tools.


Turbogears

For a good part just a nice combination of things out there with a set of useful scripts


Django

  • http://www.djangoproject.com/
  • Somewhat CMS-oriented
  • Has major levels of server-side framework in MVC-like arrangement, like TurboGears, Rails, etc.
  • ORM based database access
  • Has a pure-python development server, but production suggested to be WSGI or FastCGI

Bottle

Zope

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)




See also:

Pylons

Nevow

  • http://divmod.org/trac/
  • Templating that build on Twisted (...web) and provides a web-app framework including templating, a javascript bridge and such things.
  • more specific than most of the others mentioned here, but powerful in its way


Karrigell


Quixote

  • http://www.mems-exchange.org/software/quixote/
  • geared to integrate relatively easily with basic Python (though that's true for all, just to different degrees)
  • Fairly general; apparently works as/under plain CGI, FastCGI, SCGI, mod_python, Twisted, Medusa.


Porcupine

  • http://www.innoscript.org/
  • Geared to web GUIs
  • Business/Binding/Presentation layers model (but the latter are optional)
  • Templating: QuiX (geared to interface design), PSP or servlets


Webware


simpleweb

Templating

This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software).

Cheetah


Genshi


Jinja


Spyce