Python usage notes/Networking and web

From Helpful
Jump to: navigation, search
Syntaxish: syntax and language · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly

Threads and processes · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time


URL fetching


urllib2 and httplib are standard libraries, but their APIs is a little tedious for both basic things, and various custom stuff (see below).

requests is a library you have to install, but it makes life so much simpler that you may want to.

Simple examples:

import requests
# simple text get
r = requests.get('')
print r.text # decoded according to r.encoding
r = requests.put('', 
                 data = {'key':'value'},   headers={'user-agent':'my-app/0.0.1'},
print r.status_code
print r.content # bytes
print r.headers
print r.cookies

It also has a nice interface to many of the less usual things you may occasionally need, like OAuth, certificates, streaming, multiple file uploads. Also timeouts are a bit easier, as they should be.



URL fetching is probably most commonly done with urllib2.

create a Request object (...optionally alter it to change exactly what request to make)
do the actual request on the network
read out the response (often just for its body data)

The simplest version is something like:

req  = urllib2.Request(url)
response = urllib2.urlopen(req)
bodydata =


  • on exceptions:
    • The most common exceptions are probably socket.error, urllib2.URLError, and httplib.HTTPException. You can separate them out if you want different handling/logging for each.
    • HTTPException lets you .read() the error page contents.
    • HTTPError is a more specific subclass of URLError - in case you want more detailed cases (e.g. split out timeouts). For error reporting, note that the former has e.code, the latter e.reason.

urllib2 fetcher function

Note: If you don't mind installing extra modules, there are now much better, e.g. requests.

When tied to the standard library, I often use a helper function that makes more basic fetches simpler. At one point it looked like:

import socket, httplib, urllib, urllib2
def urlfetch(url, data=None, headers=None, raise_as_none=False, return_reqresp=False):
    """ Does a HTTP fetch from an URL.   (Convenience wrapper around urllib2 stuff)
        By default either returns the data at the URL, or raises an error.
        data:          May be
                        - a dict               (will be encoded as form values),
                        - a sequence of tuples (will be encoded as form values),
                        - a string  (not altered - you often want to have used urllib.urlencode)
                        When you use this at all, the request becomes a POST instead of the default GET
                           and seems to force  Content-type: application/x-www-form-urlencoded
        headers:        a dict of additional headers.
                          Its values may be a string, or a a list/tuple (all will be add_header()'d)
        raise_as_none:  In cases where you want to treat common connection failures as 'try again later',
                          using True here can save a bunch of your own typing in error catching
        - if return_reqresp==False (default), the data at an URL in a string
        - if return_reqresp==True,            the request and response objects 
        The latter can be useful reading from streams, inspecting response headers, and such
        if type(data) in (tuple, dict):
        req  = urllib2.Request(url, data=data)
        if headers!=None:
            for k in headers:
                vv = headers[k]
                if type(vv) in (list,tuple): # allow multiple values for a header name
                    for v in vv:             #  (emit as multiple headers)
                else: # assume single string.  TODO: consider unicode
        response = urllib2.urlopen(req)
        if return_reqresp:
            return req,response
    except (socket.error, urllib2.URLError, httplib.HTTPException), e:
        #print 'Networking problem, %s: %s'%(e.__class__, str(e)) # debug
        if raise_as_none:
            return None
# Example:
formdata    = ( ('id','fork22'),  ('submit','y') )
headerdict  = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
htmldata = fetch('', data=formdata, headers=headerdict)
CLOSE_WAIT problem
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

It seems that older versions of urllib2 (apparently in unices up to py2.5ish, windows up to ~py3(verify)) had a bug where it left some uncollectable objects behind (including an open file descriptor), which could cause a few different problems.

See also

One result was that a connection would linger in CLOSE_WAIT state (meaning it's closed on the other end, but not by us), which can potentially be a problem for both ends if they are very busy connection-wise and/or these connections may stay around for very long.

When you cannot upgrade past this bug or don't want to assume everyone has new versions, there's a fairly simple code fix too (setting some reference to None when you are done with the handle and will be closing it). I put it in my convenience function.

try: # where handle is the object you get from urllib2's urlopen
    handle.fp._sock.recv = None
except: # in case it's not applicable, ignore this.

TODO: check history - when was this fixed, and to what degree did this vary with OS?


If you want POSTs with arbitraty data or methods, as some protocols require, you'll notice urllib2 only does POSTs with application/x-www-form-urlencoded.

You'll probably want to use httplib to write slightly lower-level code, for example:

conn = HTTPConnection('localhost', 80)
conn.request('POST','/path', post_data, headers)
httplib fetcher function

A similar-to-the-above life-simpler-maker. -er.

def httplib_request(url, username=None, password=None, data=None, headers=None, method='GET'):
    ' headers should be a mapping, data already encoded'
    uo = urlparse.urlsplit(url)
    host = uo.netloc # includes port, if any                                                                                        
    port = None
    if ':' in host:
        host,port = host.split(':',1)
    if username and password: # Basic HTTP auth                                                                                     
        base64string = base64.encodestring('%s:%s' % (username, password)).replace('\n', '')
        headers['Authorization'] = "Basic %s" % base64string
    if uo.scheme.lower() in ('http',):
        conn = httplib.HTTPConnection(host, port)
    elif uo.scheme.lower() in ('https',):
        conn = httplib.HTTPSConnection(host, port)
        raise ValueError("Don't understand URL scheme %r in %r"%(uo.scheme, url))
    path = uo.path
    if uo.query: 
        path+='?'+uo.query # note that this doesn't cover all uses
    conn.request(method, path, data, headers)
    resp = conn.getresponse()
    data = # close() can cut off before we read data, so we read it here.
    return resp, data


By default, sockets block without a timeout, which may cause processes to hang around too long.

Since Python 2.3, you can set an interpreter-wide default for anything that uses sockets:

import socket

The fact that is is interpreter-global is hairy - the last value that was set applies. Particularly threaded programs can become less predictable.

But more to the point, some protocol implementations may rely on this global being sensibly high, so setting it too low can break things.

Since python 2.6, you have per-socket control, though few higher-level functions expose it.

See 'Socket Objects' in the documentation, and perhaps 'Socket Programming HOWTO' for more detail.

DNS lookup

Much of the socket library is specific to IPv4. The exception is socket.getaddrinfo in it follows the system configuration.

I've used a convenience function like:

import socket,urlparse
def dns_lookup(hostname, prefer_family=socket.AF_INET):
    """ Looks up IP for hostname  (probably FQDN)
        This also accepts URLs, to save you some coding
        (looks for presence of :// and picks out hostname using urlparse)
        Returns an IP in a string,
                or None  in case of failure (currently _any_ exception)
        prefer_family can be a socket.AF_* value (e.g. AF_INET for IPv4, AF_INET6 for IPv6), 
                          or -1  for don't care
    if '://' in hostname: # is a URL
        hostname = urlparse.urlparse(hostname).netloc
        if '@' in hostname:  # take out  user@  or  user:pass@  if present
            hostname = hostname[hostname.index('@')+1:]
        retval = None
        for entry in socket.getaddrinfo(hostname,0):
            (family, socktype, proto, canonname, sockaddr) = entry
            # Assumptions: first member of sockaddr tuple is address (true for at least IPv4, IPv6)
            #              'prefer' here means "accept any, but break on the first match"
            retval = sockaddr[0]
            if prefer_family==-1  or  prefer_family==family:
                retval = sockaddr[0]
        return retval
    except Exception, e:
        print "Exception",e
        return None
# examples:
dns_lookup('', prefer_family=socket.AF_INET6)
# ...which on my initial test system returned  IPv4 addresses for both,
# because I don't have IPv6 enabled

Getting MAC address

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

  • uuid.getnode()
returns MAC as 48-bit int -- except when the code fails to do so, in which case it returns a random UUID
also not necessarily predictable with more than network controller (e.g. lan+wifi)

  • ioctl on an UDP socket
not really crossplatform
can ask for specific interface

  • netifaces module [1]
not standard library
  • getmac
not standard library
  • psutil
not standard library

  • /sys/class/net/interface/address
  • ifconfig

  • wmi module


Python library details


Tracebacks for basic CGI, to give more useful error reports than 'internal server error'.

The simplest use in bare-bones CGI is:

import cgitb

In other frameworks it needs a little more work, because the above assumes CGI in the style of writing to stdout.

Also, you often want a little more control over the the exception catching.

You might wrap your real handler call in something like:

    return realhandler()
    # it may help to force content type to 'text/html' through whatever mechanism you have
    cgitb.Hook(file=output_file_object).handle() #the interesting line

Forms, File uploads

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Parsing an URL, resolving relative URLs

The urlparse module (urllib.parse in py3k) splits a URL into its various parts, can join them again, resolve relative URLs, and such.

urlparse() splits a URL into a tuple: (scheme, netloc, path, params, query, fragment). The object it returns is actually a subclass of tuple that also contains the same information as members.

urljoin( context_url, path) also deserves mention, being useful to take links that may be relative or absolute paths, or entire URLs, and resolve them into full URLs (in the context of the page's URL). The behaviour is basically what you would expect:

assert urljoin('',  '')  == ''
assert urljoin('', '')  == ''
assert urljoin('',  '/') == ''
assert urljoin('', '/') == ''
assert urljoin('',      '/bar') == ''
assert urljoin('',     'bar')  == ''
assert urljoin('',      'bar')  == ''
# starting from page
assert urljoin('',  'bar')  == ''                
assert urljoin('',  '/bar') == ''
# starting from directory:
assert urljoin('', '/bar') == ''                
assert urljoin('',   'bar')      == ''
assert urljoin('',   'bar/')     == ''
assert urljoin('',   'bar/boo')  == ''
# absolute:
assert urljoin('',   '')  == ''


See CGI,_FastCGI,_SCGI,_WSGI,_servlets_and_such#WSGI, mod_wsgi notes and some notes elsewhere (e.g. those on the CherryPy page)


Note this section describes things that either are, or have their own servers. Frameworks focus more on the logic on top of them, though since they sometimes depend tightly on some server or other they are by no means incomplete.


Various WSGI

See CGI,_FastCGI,_SCGI,_WSGI,_servlets_and_such#WSGI

Twisted Web

  • (Twisted/Web)
  • Twisted is higher-level networking things, but also has a HTTP server - though I'm guessing you still need to work in a twisted sort of model:
  • Providings concurrency without threading: is event-based and requires any potentially blocking things to be deferred (verify)
  • Looks like it'll work under most OSes, with a managable extra dependency here and there.



  • now a dead project
  • can be quite fast as persistent interpreters handle many requests over time
  • A few different handlers, including PSP templating (note that Vampire is a useful addition to these)
  • has some peculiarities you have to work around
  • not actively developed anymore

BaseHTTPServer and friends

  • In the standard python library, so always available
  • Consists of quite basic handling for HTTP requests.
  • You have to decide how to decide what to do based on incoming URLs yourself (no 'run script if there, fall back to file serving' logic at all)
  • A single-purpose server like this can be written in a dozen or two lines, see the example.
  • Not threaded, asynchronous: things are served in sequence, so if one handler is heavy, it will block the next requests.

There are also SimpleHTTPServer and CGIHTTPServer, building on (and interface-compatible with) the BaseHTTPServer:

  • Both in the standard library
  • Simple... serves files based on a simple URL-path-to-directory-path map.
  • CGI... adds the ability to run external scripts for output. (except on MacOS (pre-X or general?(verify))). It sends a 200 status, then hands output to the executable (so it doesn't serve all needs).

You can make servers using the above thread and fork using using the two SocketServer MixIns. It seems to make the TCPServer (that the HTTPServers are based on) wrap requests in a function that creates a thread/forked process, then calls the actual handler.

Minimal code for the basic, threaded and forked versions:

from BaseHTTPServer import HTTPServer, BaseHTTPRequestHandler
from SocketServer import ThreadingMixIn, ForkingMixIn
import cgi
class ThreadingBaseHTTPServer(ThreadingMixIn, HTTPServer):
class ForkingBaseHTTPServer(ForkingMixIn, HTTPServer):
    " Not on windows, of course, since it doesn't do forking. "
class ExampleHandlerSet(BaseHTTPRequestHandler):
    """ Hello-world type handler """
    def do_GET(self):
       import threading # only imported for the print below
       vars = cgi.parse_qs( self.path[1:],keep_blank_values='true' )
       if 'name' in vars:
           self.wfile.write( "Hello WWW, hey %s"%name )
           print "%d threads"%threading.activeCount()
           self.send_error(404,"No name given.")
# Choose one:
#srv = HTTPServer(('',8765),ExampleHandlerSet)
srv = ThreadingBaseHTTPServer(('',8765),ExampleHandlerSet)
#srv = ForkingBaseHTTPServer(('',8765),ExampleHandlerSet)
srv.serve_forever() #there are alternative ways of starting, this one's the shortest to type
A very basic handler like this is pretty fast.
ab2 -n 1000 -c 1 http://localhost:8765/name=Booga
shows that 99% of requests are served within 1ms, or 6ms when causing the 404. I'm not sure why the other 1-2% took up to 100-300ms. Perhaps some occasional cleanup.

About the threaded versions:

  • This is taken from an example apparently quoted from the book "Python Web Programming," along with the note that the MaxIn class should be the first in that inheritance list.
  • As always with threading, watch your race conditions and all that mess.
  • This sort of concurrency isn't not geared for speed. The treads aren't as lightweight as they could be nor the cheapest solution in the first place.
  • You can't control the amount of threads (without implementing threadpooling) and various OSes start having problems above a few thousand threds, so this won't scale to very heavy load.
  • On windows (Python 2.5) it started refusing connections when I ran ab2 with a concurrency of 6 or higher. I'm not sure why (could be windows being pissy, perhaps the firewall)

Something based on asyncore will be more efficent and still based fairly simply on python internals. Medusa does this.



This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software).

The more complex frameworks often employ a (framework-wise) MVC model in that their major components are usually:

  • an Object-Relational Mapper,
  • a templating framework, and
  • their own URL dispatching and/or webserver

...and often various tools.


For a good part just a nice combination of things out there with a set of useful scripts


  • Somewhat CMS-oriented
  • Has major levels of server-side framework in MVC-like arrangement, like TurboGears, Rails, etc.
  • ORM based database access
  • Has a pure-python development server, but production suggested to be WSGI or FastCGI



This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

See also:



  • Templating that build on Twisted (...web) and provides a web-app framework including templating, a javascript bridge and such things.
  • more specific than most of the others mentioned here, but powerful in its way



  • geared to integrate relatively easily with basic Python (though that's true for all, just to different degrees)
  • Fairly general; apparently works as/under plain CGI, FastCGI, SCGI, mod_python, Twisted, Medusa.


  • Geared to web GUIs
  • Business/Binding/Presentation layers model (but the latter are optional)
  • Templating: QuiX (geared to interface design), PSP or servlets




This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software).