Python database notes

From Helpful
Jump to navigation Jump to search
Syntaxish: syntax and language · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Tasky: Concurrency (threads, processes, more) · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time


Notebooks

speed, memory, debugging, profiling · Python extensions · semi-sorted

📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.

SQL

DB-API

If you need access to an existing relational database, it's easiest to use a module for that database that implements the Python database API (PEP249, DB-API 2). (See some useful notes [1] and some examples [2])

parameters, paramstyle

These packages will allow parametric construction of SQL, so that you never have to be the reason for SQL injection attacks.

You can write something along the lines of:

ti=" 'Tonsils!' "
te=' "Fish!" '
curs.execute("INSERT INTO tabby (title,text) values (%s,%s)", (ti,te))


paramstyle mostly means "the way you mark 'value goes here', and the way the library picks it up".

It's a bit of a pity there isn't just one very standard one (or one standard tuple one and one standard dict one), but it's not much extra work to switch styles, so hey.

The different paramstyles are:

  • 'qmark' - Question mark style
    • e.g. ...WHERE name=?, and hand in a tuple
  • 'numeric' - Numeric, positional style
    • e.g. ...WHERE name=:1, and hand in a tuple
  • 'named' - Named style
    • e.g. ...WHERE name=:name, and hand in a dict
  • 'format' - printf-style format code (only ever using %s is fine, though)
    • e.g. ...WHERE name=%s, and hand in a tuple
  • 'pyformat' - Python extended format codes
    • e.g. ...WHERE name=%(name)s', and hand in a dict


Notes:

  • Library might have a their own default paramstyle(verify) though it seems at least MySQLdb and psycopg2 default to 'format'.




When you hand in data in this way (tuple or dict, depending on paramstyle, see below), common types are converted by the library. More complex types (such as timestamps) may not all be automatically escaped by all libraries (the PEP doesn't require it), so you may need to explicitly use (library-specific) wrapper objects for conversion.


Don't confuse the %s above with the printf-like ("%s"%(ti)) way of doing this - these %s are merely markers, and only one out of five alternatives.

In fact, each dbapi2-style package may choose a particular paramstyle, and you have to use the one the library uses. This is probably the most annoying part of this standard. (a few allow you to set different paramstyles, but this isn't standard behaviour) (Sure, the library reports which it uses, but it does mean that supporting all databases at once will need some wrapping for paramstyle)

There are also styles that reference a dictionary with keys, which can be handy as it does not rely on argument order.

Notes on specific libraries

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

MySQL:

  • MySQLdb
    • uses 'format' paramstyle, though uses 'pyformat' if you hand in a dict ((verify) since which version. Supported in 1.2.2, and probably earlier but I've seen bug reports about earlier versions))
    • can create cursor as DictCursor to have fetch functions return a dict instead of an array, e.g.,
      cur = conn.cursor(MySQLdb.cursors.DictCursor)
    • You may care about MySQL_notes#MySQLdb_and_encodings

Postgres:

  • psycopg
    • can create cursor as DictCursor to have fetch functions return a dict instead of an array, e.g.,
      cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
      (note that it will return a row object that prints like a list but also acts like a dict. If you want a dict only, use RealDictCursor)

SQLite:

  • pysqlite3
    • uses 'qmark' paramstyle


MSSQL:

  • mssqldb
    • uses 'qmark' paramstyle


Oracle:

  • DCOracle2
    • uses 'numeric' paramstyle


Others:

  • mxODBC
    • uses 'qmark' paramstyle
  • DB2
    • uses 'format' paramstyle
  • sapdbapi
    • uses 'named' paramstyle

Result iteration - or not

Note that cursors objects are iterable (and rows are tuples) meaning you can do things like:

cursor.execute("select id,text from tag")
for i,text in cursor:
    #do stuff

...which is noticeably shorter than the also fairly common and effectively equivalent:

cursor.execute("select id,text from tag")
while True:
    row=cursor.fetchone()
    if row==None:
       break
    i,text=row
    #do stuff

It seems not quite everything supports iteration over cursors objects, though.


You should also know that fetchone() (and also cursor iteration when backed by fetchone) is relatively inefficient - it means many more roundtrips than necessary. Dealing with blocks of data using fetchmany is often a little more efficient.

As a rule of thumb: If you know the data set is always small, use fetchall (which fetches all data from the database's result set into memory). If it may be large, get chunks of data with fetchmany().


Note that certain libraries may not behave according to what you think. For example, while the idea is that fetchone/fetchmany/fetchall let you control when data is transferred and how much at a time, the last time I used MySQLdb it always fetched all result data to the client anyway - which can be very disruptive to memory use (and was, which is how I found out it does this. There was a way to work around it, but it was module-specific and a little fragile).

Postgres

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

In most cases you're looking for psycopg2. For a more complete list:

  • psycopg - see below
  • pg8000 [3]
    • pure python (doesn't even use libpq)
  • popy [5]
    • Popy merged into Pygresql (in which version?) (verify)
  • PyGreSQL [6] - early attempt, not very good
  • pgsql


psycopg2

Intro

DBAPI2 adapter.

psycopg2 is current. The module called psycopg is an older version.


Some practical notes:

  • paramstyle: 'pyformat' (dict) or 'format' (tuple)
  • aside from setting the isolation level, conn.set_isolation_level() also lets you set use autocommit
  • type handling:
    • None, True, and False are converted to their SQL counterparts
    • int, long, float, Decimal: to database: simple numerical representation. From database, things become Decimal (unless you set a different conversion) (verify)
    • str, unicode: SQL strings, according to the client encoding - see notes below
    • Python tuples - usable e.g. with SQL's IN operator - but note an empty tuple is invalid SQL
    • Python lists: PostgreSQL ARRAYs. In many cases you want to wrap a list in via tuple()
    • buffer: PostgreSQL binary string, for bytea [7]
    • datetime, date, time, timedelta: PostgreSQL’s timestamp, date, time, interval
  • no automatic reconnect (verify)


Some technical notes

  • threadsafe (DBAPI level 2, meaning threads may share the module and connections, but not cursor objects)
  • Reuses connections for cursors, so you can use cursor() many times (though transaction commits and rollbacks are connection-level things)
  • you can get server-side cursors (by specifying a name when creating a cursor())
  • psycopg2 has some connection pooling code
  • Allows run-time changes of the mapping between database and python types -- which can be neat when you need it (avoids some manual work), and otherwise just look a bit bothersome
  • connection objects can be used as context managers -- but note this closes the transaction, not the connection.



See also:


On text coding, and binary blobs
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Python/psycopg level

py2:

  • sending bytes is the default
  • sending unicode is (verify)
  • receiving data is left as bytestrings by default
if you want it to give you unicode, register a typecaster Unicode handling


py3:

  • sending unicode is encoded with the connection .encoding
  • sending bytes is (verify)
if you want that as bytes, one way is bytes(mv)
  • received strings are decoded according to the connection .encoding
if you want that as bytes, there is a typecaster for that[8]


You should probably set that .encoding via .set_client_encoding() (verify) (note: to the backend, 'UNICODE' is the same as 'UTF8')





Postgres level

Note that there are two distinct things at play here

  • getting the strings to the server over a byte connection (according to client_encoding (which is a session variable(verify)))
can be altered per connection
  • storing the strings in column data (according to database-wide encoding)
cannot be altered. (you can cheat, but it is wiser to do via dump-create-insert)


These days, the datbase encoding often defaults to UTF8 (in postgres, UNICODE is an alias for UTF8).

Also, client_encoding defaults to the database encoding [9] -- assuming nothing explicitly changes it

As such, usually all unicode gets transmitted and stored transparently (as UTF-8). For reference, though:


For reference:

the server defaults to the database encoding (but can be globally forced otherwise) (verify)
libpq can pick up PGCLIENTENCODING environment variable (verify)
you can set it for a session using SET client_encoding = 'UTF8'
some code frameworks may do this


Nicer column fetching
Server-side cursors
On transactions

Postgresql_notes#psycopg_and_transactions


On pooling
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

psycopg2 has its own pooling logic. In-process, so this only makes sense for long lived processes.


Constructing

Something like:

pool = psycopg2.pool.SimpleConnectionPool( minconn=1, maxconn=10,
                                           user="postgres", host="127.0.0.1",  database = "postgres")

If your application is threaded, use ThreadedConnectionPool, as it ensures a connection will not be shared between threads.


Using

Instead of conn = psycopg2.connect(), you now call:

conn = pool.getconn()

Once you're done with it, call

pool.putconn( conn ) 

(If you don't, you'll quickly get "error in executing with exception: connection pool exhausted")



After you're done

You should curs.close(), but not conn.close() (verify)

This makes sense, but is a little counterintuitive as without pooling, closing early is good practice to avoid idle connections taking up slots.

As I tend to have a fallback to not pooling, my code tends to end up with

if pooling:
    pool.putconn(conn)
else:
    conn.close()


There is a pool.closeall(), but in most situations the gc will get to it, so you don't really need to call it explicitly.


See also:

Errors

psycopg2.ProgrammingError: syntax error at or near "ARRAY"

Usually means you're passing in a list (becomes an ARRAY), where you actually wanted to pass in a tuple.

Since that was probably a list for your own for loop, you probably want to convert it late, e.g.

curs.execute(  'update files  set instrument=1  where id in %s', (tuple(assign1),)  )

connection pooling

psycopg2's pool.py is simple in-process code that hands around connections, which can be handy for multi-threading, and for a little less management.


If you want a more module-neutral pooling code, you can e.g. use SQLAlchemy's pooling


(If you want client/language-agnostic connection pooling, you may be interested in pgpool. Since it's an intermediate process it doesn't lower all of the sources of latency, but it's an easy-drop-in that is a good start.)

MySQL

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

MySQL python libraries(verify)

  • MySQLdb, a.k.a. mysql-python [10]
    • seems the most central module for the mysql/python combination
    • has had some annoying behaviour bugs in the past
  • PyMySQL[11]
    • Pure-python; speaks mysql binary protocol itself
    • seems like a newish library(verify)
  • jaraco.mysql
    • A pure-python implementation of _mysql, apparently to the end of using MySQLdb without the need for a C compiler


MySQLdb

Cursors

MySQL doesn't support cursors much, and the MySQLdb package by default emulates cursors by copying all result data into memory, then feeding it through the DB-API fetch calls. On big datasets that can sometimes be a resource-fatal non-solution, in which case you probably want to do:

import MySQLdb.cursors
conn=MySQLdb.connect(host='localhost',db='whatever',
                     cursorclass=MySQLdb.cursors.SSCursor)

This brings in some new quirks that mean your code needs to be a lot more careful, like that you must (read all results(verify) and) close the cursor before doing anything else with the connection (hint: try/finally is handy), and that if you leave such a cursor unused it will self-destruct (on the MySQL side?(verify)).


If you get AttributeError: 'NoneType' object has no attribute 'literal', that probably means you are trying to execute on a closed connection.

text encodings

Personal note: Through various bugs I've decided that I care much more about exact round tripping than about MySQL correctly interpreting the data. I still try for both, but only when that's useful and not just a headache.


You may have a client that you need to inform you are using an encoding separately from informing MySQL itself. There is a technical reason behind this, but is still annoying particularly since it is both pretty necessary and not something people often mention.

In code and specifically for MySQLdb, this boils down to a few solutions, one of which is:

conn=MySQLdb.connect(host='localhost', db='db',
                     use_unicode=False, init_command="SET NAMES 'utf8'",charset='utf8')

The charset keywords needs to be on the connect call((verify)). The USE NAMES can also be done via an execute().

use_unicode=True means 'convert incoming string(-like) columns to python-internal strings', which e.g. converts UTF8 data to python unicode strings. False means to leave data as-is. While this means less assistance, it also means more consistent behaviour (e.g. in the face of utf8_bin). Outgoing data seems to always be converted(verify).


pre-1.2.1 problems

Using the charset keyword argument to connect() seems to do things the clean way: set the charset locally and tell the database SET NAMES (verify).


However, before 1.2.1_p2 (at least 1.2.0 and 1.2.1gamma3, probably others), charset was not a keyword to connect, and at the initial time of writing this, these were all current versions of MySQLdb so a workaround was required to avoid random breakage on different hosts.


If you do SET NAMES yourself means you inform the database but not the client, which will still report (in my case) 'latin1'. When MySQLdb converts (e.g. for use_unicode), it uses that same value it reports (even though the package author informs me this is a purely informational variable).

Experimentation seems to suggest that half a workaround is to set the charset variable in addition to executing a SET NAMES. That is, telling client and server everything is utf8 should allow you to pass through utf8 untouched.

conn=MySQLdb.connect(host='localhost', db='db', 
                     use_unicode=False, init_command="SET NAMES 'utf8'")
conn.charset='utf8'

The problem with this is that it only seems to store properly when using use_unicode=True, while fetching fails - or rather, seems to depend strangely on environment variables. It also seems to mean that the module acts differently from the interactive interpreter than from mod_python for me (well, did yesterday, doesn't today, and yes, I did explicitly reload() these).

Which suggests to me that pre-1.2.1_p2 was simply badly broken in terms of unicode, and I cannot find documented elaboration on what the code does and doesn't do, so it's a gamble. Upgrade to 1.2.1_p2 or later.


For a project, I wrote a wrapper that tells MySQLdb to do nothing, and does all conversion itself, which even though it's a hack seemed more robust of a transparent unicode interface under assumptions true for this project (e.g. that all text is UTF8 in the database).

dev versions
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The easiest way to install a version more recent than your package manager version is probably:

easy_install mysql-python

You may have to install a specific version of libmysqlclient(-dev)


ImportError: libmysqlclient_r.so.16: cannot open shared object file: No such file or directory

A few cases, including:

  • You upgraded MySQL without upgrading the MySQLdb module (package called python-mysql or mysql-python or so). You can install the older mysql client, install a newer version of MySQLdb, or compile MySQLdb against the version you have.
  • If you have an egg installed for MySQLdb, this is likely to cause problems (statically linked, not part of package management). You generally want to use your distro's package management, so probably want to delete these eggs.

Text

Lists

Multiple tuple inserts

Multiple inserts, which will often be faster than separate ones, can apparently (verify) be done like this (example from MySQLdb docs):

c.executemany( """INSERT INTO breakfast (name, spam, eggs, sausage, price)
                                 VALUES (%s, %s, %s, %s, %s)""", 
                  [ ("Spam and Sausage Lover's Plate", 5, 1, 8, 7.95 ),
                    ("Not So Much Spam Plate", 3, 2, 0, 3.95 ),
                    ("Don't Want ANY SPAM! Plate", 0, 4, 3, 5.95 )   ] )

This will also work for the hash-based markers, etc.


Note that this will repeat the execution, which is similar to but not strictly equivalent to:

INSERT INTO breakfast (name, spam, eggs, sausage, price)
VALUES (%s, %s, %s, %s, %s),(%s, %s, %s, %s, %s),(%s, %s, %s, %s, %s)

However, this is not as clean, and usually means the same thing anyway.


Data encoding/translation/safety

to rewrite


lists and chunking

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Every now and then you want to supply a list of elements into a single query, e.g. make select * from tabby where id in (%s) turn into select * from tabby where id in (1,2,6,8,21), particularly when the list is larger, since the alternative is doing as many selects as there are (in this case) ids.

Note that the only reason you should want this is if you have this data only on the client. If a temporary table works, use that instead.


DB-API does not allow lists, at least not while using proper escaping. Integers are of course safe, but I consider it bad style to abandon a good thing.

You can work around this by using the escape function that the specific DB-API library uses. Yes, this locks you to the library/libraries you actualyl use, which is also bad style, but it is still useful.

There is also an extra problem to be fixed, which lies in the fact that the list can be very long - larger than a single packet to the server will allow (limits may be set by server, client, and ODBC-like abstraction layers), in which case you should probably split it into separate queries that each handle a chunk of arguments.


In other words, you want a function that takes a single list, and returns a list of escaped strings (either a list of string elements or a single string), that you can substitute in safely via %s.

TODO: clean up and paste in code

Backend alternatives

SQL, transactional

  • PostgreSQL
  • MySQL
  • SQLite (file-based) (being local IO may make this faster than networked databases for small to medium scale projects, and it scales decently)

When considering SQLite, note that if you want the ability for concurrent access but don't need transactions nor necessarily a relational approach, Berkeley DB may offer a more flexible and sometimes faster answer.


Zope's Zcatalog is lazier, which can be faster for certain applications.

File databases

shelve

You can use shelve to get automatic pickling of python objects to disk. shelve objects act like disk-backed dicts (but see the writeback option).

The shelve module chooses a backend automatically (dbhash, gdbm, dbm, dumbdbm), and inherits its properties, including possible extra files and file extensions.

You can also choose to get a shelve backed by something specific. In the case of BDB, there are two ways of doing so, shelve.BsdDbShelf and bsddb.dbshelve. The latter allows you to use a DBEnv, for concurrent, e.g. thread-safe access.


On writeback

The basic shelve is just a drop-in for a dict that does nothing more than unserializing from disk, and serializes to disk whenever something is assigned to a key.

That last part also means that mutable objects stored in a dict will not be automatically written back when you change them. This is perhaps most visible when you try to change mutable objects like:

d['ar'].append(5)
#or
d['di']['data']='val'

You are not actually setting the value for a key in the dict, so the write-back code that sits there isn't triggered.

You often want to be sure you will use the shelve's setter. The most general way to do this is to get the (full) value into a python variable, change it, then set it into the shelve again under the same key.

ar = d['ar']
ar.append(5)
d['ar'] = ar

The above is true when writeback=False, the default. When writeback=True, you only access an in-memory cache of all values you get and/or set, meaning there are no problems changing mutable variables because nothing is written until you sync() to disk (note that close() is also an implicit sync()). However, keep in mind that this cache does not have a size limitation, meaning it can take a lot of RAM, and a sync (e.g. on close) may take a while flushing a large amount of changes.


Oldschool

See also File database notes, and Berkeley DB.

In python:

  • dbm: interface to the Unix dbm / ndbm library.
  • gdbm: interface to the gdbm library.
  • anydbm is a generic interface that eases opening dbm-style databases; it uses the whichdb module to guess guess what type the file is and uses the appropriate module for it. It can also fall back to dumbdbm if a library is not available; dumbdbm is a portable but slower implemention
  • bsddb for Berkeley DB
  • dbhash: an interface to bsddb that mirrors the functions present on the dbm-style modules.


These libraries often make an open database usable as a dictionary object, enabling you to transparently persist data, although not all interfaces implement all dictionary functions. For example, keys() and values() may not work, and some provide iterators based on the hash, unordered but fast.


On concurrent access

  • assume nothing is threadsafe unless mentioned otherwise. This includes shelve.
  • in general, things do not allow concurrent multiprocess read/write, though you can often do concurrent reading. For concurrent writing, you could use file locking.
  • bsddb:
    • The legacy interface (bsddb.hashopen, btopen, rnopen) functions became thread-safe in Python 2.5
    • you get concurrent multiprocess and threadsafe writing when you use use the bsddb.db API (since Python 2.3). Requires BDB 3 or later. Documentation and newer versions (should you need them) are here

shove

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

(You might also be interested in stockpyle)


shove is a pluggable system of stores and recent-item caches, with optional compression.

In code, shove object act like a dictionary.

It depends on libraries for certain functionality (but only once you ask for them, so they're not hard dependencies): sqlalchemy for database stuff (...that can't be handled by the standard lib like bsddb and anydbm), memcache for memcached, boto for S3 storage, durus, pysvn for subversion, ZODB for Zope Object DB)}}


Store and cache types

Stores:

  • simple: simple dict-like in-memory store (no size limit(verify)). The default.
  • memory: threadsafe variant of simple (adds locking)
  • file: file-backed variant of simple
  • dbm: anydbm-backed store
  • bsddb: bsddb-bascked store
  • sqlite, mssql, postgres, firebird, mysql, oracle (uses SQLalchemy)
    • takes keywords: tablename (defaults to 'store')
  • svn: subversion-backed store
    • takes more keywords (alternative to embedding login in URL?(verify))
  • s3: Backed by Amazon S3. Requires boto
  • zodb: Backed by Zope Object DataBase see e.g. [12]
  • ftp: Backed by FTP server
  • durus: Backed by Durus (python persistance)


Caches:

  • simple: in-memory. The default.
    • takes keywords: max_entries (default:300), maxcull (default:10), timeout (default:300). Item-set chooses to cull (relatively frequently(verify). Culling is random).
  • memory: threadsafe variant of simple
  • file: file-backed variant of simple (keys map to filenames in a cache directory)
  • simplelru: in-memory LRU cache
    • takes keywords: max_entries
  • memlru: thread-safe variant of simplelru
  • filelru: file-backed variant of simplelru
  • memcache: Uses memcache module for memcached access. Address spec is semicolon-separated server list.
  • bsddb:
  • mssql, sqlite, postgres, firebird, mysql, oracle: Mostly just uses SQLalchemy
    • takes keywords: tablename (defaults to 'cache')




Notes on cache and store speeds
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

This is not based on proper benchmarking. Don't trust it, take all figures roughly. Assume things may well be different at different scales.


On cache speeds

When you want a shared cache, you probably want memcached. If not shared, memlru is functionally much the same.


The rest of the memory cache types are per-process caches. For that set, my suggestions are:

  • use lru types (simplelru/memlru) over the non-lru caches even if access patterns are purely random;
  • use memlru if you need thread safety, use simplelru if you don't (a little faster)
  • set max_entries high enough for your use (the default is conservative)
    • When using a non-lru type (i.e. simple-style cleanup), this has a positive effect only until the cache is full (before it is full there's no culling at all, but after it is the culling work takes that much more. If you insist on this type of cache, you can avoid some stuttering when you set maxcull proportionally higher as well)
  • For large and highly volatile caches, you may wish to avoid slow-culling caches (simple style), and avoid the higher overhead of syncing caches when you don't need it.
  • fill-once-and-read-many-times caches are of course very different


Notes on store speeds:

The speed of databases depends a number of things that you probably already know: how local/remote the connection is, whether the database can serve you in/from memory, whether you are assisted by the OS cache, and such. When well-tuned, they can be fast.


Of the file-based storage:


The read times of file as well as file databases seems to all be pretty much the same - for smallish data sets.


file is fairly fast and can scales decently when the underlying filesystem is not a limitation (think files-per-dir limit). Given that the underlying filesystem is usually you can't control, a file-based database is often a more portable/predictable solution.


File-based databases seem to have a little more write-time overhead than a good filesystem, but overhead which is more predictable (less jittery). Access times are not affected much by record amount / data size. Particularly read-much-more-than-write stores are easy this way.

bsddb and dbm seem to read slightly faster and write somewhat slower than file. bsddb seems to have more stable (less jittery - but also somewhat higher) access times than dbm (I get the impression that bsddb isn't used as efficiently as it could be), and both figures are more stable (and higher) than file.

sqlite somewhat comparable to file (can't re-test it right now - sqlalchemy is being broken on my host right now)


Notes on concurrent access
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Others options

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


  • SQLite via PySQLite,
    • possibly using an ORM like SQLObject or SQLAlchemy
    • or Axiom (which is currently basically a simple interface to SQLite)


  • shove - see notes above (BSD license)


  • Buzhug (pure-python, pythonic access, BSD license)
  • kirbybase (pure-python, pythonic access, Python License (BSD style))
  • Gadfly (pure-python, SQL-based access, license is 'keep copyright' notice)



  • cdb (disk hash map)



ORM notes

SQLAlchemy

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

SQLObject

Idea

SQLObject useful for data persistance that is implicitly ACID (and therefore somwhat less bother than direct pickling).

You define a schema on the python side in the form of a class.

This class to with a database table
rows to objects
columns to attributes.


Instantiating an object using the schema class will create a row for it and return the object that corresponds to it.


Setting a member on that object will update the corresponding row, retrieving it will fetch it from the database (...so for speed, copy out the value instead of using the object member whenever doing that is safe).

Note that database values may not always be what you last set on an object -- rules of defaults-on-errors, procedural value filters, concurrency and sometimes duck typing apply.


To remove a row, call destroySelf() on the python object that represents it.

Doing a wider delete operation with a WHERE is also possible, see [16].


Declaring

For that to work, SQLObject needs a schema - it needs to know what the database looks like.

While SQLObject can be wrapped around any table that meets a few requirements, this sometimes is a lot of tweaking, so if you're starting from scratch anyway, consider working from the Python side: describe the schema in a class, then have SQLObject create the according table(s) in the database.

A class-that-goes-to-a-table extends basic SQLObject class

implies an id (int, primary key?) column that is used to keep track of rows, that you can read out and select on if you want to.
See the documentation for the available column types.
They include IntCol, StringCol, UnicodeCol (transparent UTF8 coding), DateCol, TimeCol, DateTimeCol. If you want to store binary data or (complex) python data structures, BLOBCol and PickleCol are particularly worth looking at.
and will take keyword arguments like
length, primarily for strings
default to specify a default value. This may be a callable, which will get executed later. You could e.g. use time.time
unique=True to add a UNIQUE constraint
notNone=True to add a NOT NULL constraint (NULL is None on the python side)


Database connections - a connection is also a connector object, in that SQLObject remembers how the connection was made, creates connections as necessary later, and pools them unless disabled using conn._pool=False. It should work properly under threading (e.g. threadpooled web servers), though there may be extra details to specific backends.

It keeps connections open, so long-running programs may meet the problem that a backends may time out and remotely disconnect without signaling this, which will cause an error on the first query that tries to execute on the actually-already-broken connection. You may want to add a keepalive to avoid this. (other solutions?(verify))


Since you select from the schema class and, you usually set the connector on it:

 mydbmodule.Fish.setConnection(conn)

Now you can use that schema class to do selects, etc.:

 print list( mydbmodule.Fish.select() )


The connection/connector object itself can be either based on the class for the specific backend you use, like:

conn = sqlobject.SQLiteConnection('/user/databasename.db')
conn = sqlobject.MySQLConnection(host='example.com', db='foo', user='foo')

...or you can use a connector factory though a uniform URL like:

conn = sqlobject.connectionForURI( 'scheme://[user[:password]@]host[:port]/database[?parameters]' )

Example URLs:

'mysql://user:password@host/databasename'
'postgres://host:5432/databasename'
'sqlite:/C|/full/path/to/databasename'
'sqlite:/:memory:'                      #Older versions see this as an actual filesystem path



If the database serves more than one client / process, you probably want to disable caching. To tell sqlobject to not use cached values, declare your schema with a nested class:

 class Fish(sqlobject)
     class sqlmeta:
         cacheValues = False
     #definition as usual

Note: before 0.7, this and more was not sorted into sqlmeta and was set directly in the class. This may still work, but is probably depracated.


Disabling the cache is the best way to have concurrent access work on widely connected-to databases. (You probably want to use transactions too, to avoid data race conditions) Disabling the values cache will slow things down, though, since each individual sqlobject attribute get will trigger a select

Each attribute set will cause an update, which, if not necessary (or when you prefer to do this per transaction) you can alleviate the latter by enabling lazyUpdate, which means you have to sync sych sets manually.


Example

import sqlobject
from sqlobject import *
import datetime,time

#Define schema
class Fish(SQLObject):
    datatype        =     StringCol( length=50,  default='pending' )
    name            =    UnicodeCol( length=100, alternateID=True  )
    entered         =   DateTimeCol(             default=datetime.datetime.now )
    datatype_index  = DatabaseIndex( 'datatype' )


#Make SQLite connection to file, and set on the schema.
conn = sqlobject.connectionForURI('sqlite:/tmp/sqlo.db')
Fish.setConnection(conn)  # (there are other ways)

#Create tables according to schema. 
#(but don't try to create tables that already exist)
Fish.createTable(ifNotExists=True) 

#Create three objects
Fish(name='foo',datatype='pending')
Fish(name='bar',datatype='done')
Fish(name='zot')

Using the data:

#byName was created because name alternateID field
print Fish.byName('zot') 
#you can update things directly
Fish.byName('zot').name='littlebunnyfoofoo'

Show all contents:

>>>for t in Fish.select(): #no condition means select all.
    print " id %s: %s"%(t.id, `t`)

Result:
 id 1: <Fish 1 datatype='pending' name=u'foo' entered='datetime.datetime...)'>
 id 2: <Fish 2 datatype='done' name=u'bar' entered='datetime.datetime...)'>
 id 3: <Fish 3 datatype='pending' name=u'littlebunnyfoofoo' entered='datetime.datetime...)'>

Basic and complex queries

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Selecting/fetching objects is similar to a SELECT *, giving you a list of objects:


The whole table:

 Fish.select()

You can easily use basic SQLish query features:

 Fish.select(coltype='done', orderBy=Thing.q.date_entered, limit=10)

Using your own WHERE clause:

 Fish.select(' length(name)>3 ', limit=10)

By a alternateID (UNIQUE, INDEXed) column:

 Fish.byName('zot')

(The schema has created this function automatically. You do have the option of changing the name of the function it creates)


Using python syntax magic in SQLBuilder, using a relatively simple expression instead of an explicit where):

 Fish.byName( Fish.q.name=='zot' )

This is perhaps mostly interesting to use in join fields.


Notes:

  • If you want multiple clients, you'll probably define the schema with value caching off. However, that means that every get (e.g. .name) and set (e.g. .name=value represents a query to the database. This can be very slow if, for example, you do this in a loop (enable debugging to see whether this is happening). Try to do all sets/stores in one go.
Joins and such
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Unsorted

BLOBCol and PickleCol
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Additionally, BLOBCol are base64'd to be safely storable in text fields. This means both overhead and that other clients can't read them without some wrapper code.

PickleCol inherits this, so this is extra overhead on top of the pickling.


Most backends have a query size maximum, and SQLite SQLObject uses the query string to set any and all values, meaning that you are restricted by whatever query size limits apply, often 1MB -- which probably becomes less than ~0.8MB given the overhead from pickling and base64.

You can try to work around this with separate parametric queries, but such solutions are generally specific to backends, and should be careful with locking.


In MySQL

When the table is created, and a BLOB is called for (for a BLOBCol or PickleCol), MySQL chooses a particular variation of BLOB based on the length you give it (see MySQL notes.

If you do not specify a length it defaults to a TINYBLOB (up to 255 characters), which will easily lead to errors like "Pickle data truncated."

Specify a length that will satisfy all your needs.


In SQLite

While by default you can store 1GB per row / cell, sqlite's query size limit is 1MB. IF you cross either of these limits, you get a "DataError: String or BLOB exceeded size limit." (See also SQLite notes)


This can be a problem because sending text/blob data beyond 1MB are encoded into query itself (and consider the base64 overhead).

The 1MB query size limit is the default in SQLite. Some distro packages have it set higher, and you can compile your own version of it with a higher limit. This is one solution to this problem - but your code still won't work on sqlite libraries with lower limits.


If you want to try to work around this with parametric queries, know that sqlobject doesn't use sqlite blobs or the Binary class to back BLOBCol/PickleCol - it pickes, then base64s, into text values. SQLObject doesn't seem to support sqlite blobs at all (that is, using Binary to set (and get) buffer objects seems to fail).

The code you'ld want would probably be something like:

pickledata = pickle.dumps(data,2)
cur.execute("INSERT INTO test (test) VALUES (?)",(base64.encodestring(pickledata),))

(note: entirely unverified)

Note: since sqlite's storage is duck typed, it doesn't matter that the column type is VARCHAR, and it uses text values. It does make a difference, though, since currently SQLObject doesn't use or support sqlite's blob types; it'll trip over the buffer objects(verify).

This doesn' matter so much, since text values are due to the same limit. You can work around the problem(verify) by leveraging pysqlite2 to set the large text field (imitating SQLObject's encoding) via a parametric query. Be careful about database locking, though.

See also


Yet to read:

Shelves

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Using a BDB shelve with a DBEnv can makes for thread-safe object storage. See also Python notes/Databases#shelve.


others

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
  • ORMit

Berkeley DB

Buzhug notes