Python database notes
From Helpful
| Various things have their own pages, see Category:Python. Some of the collection pages among them: |
| These are primarily notes This is probably not going to be complete in any real sense, and exists to contain bits of useful information. |
Contents |
SQL
DB-API
If you need access to an existing relational database, it's easiest to use a module for that database that implements the Python database API (PEP249, DB-API 2). (See some useful notes [1] and some examples [2])
parameters, paramstyle
Aside from being standardized and fairly well supported by different packages, it will also save you from escaping headaches by using marker-based parametrization. This allows things to come straight from variables, literals, meaning the library does all the work for you.
You can write something along the lines of:
ti=" 'Tonsils!' " te=' "Fish!" ' curs.execute("INSERT INTO tabby (title,text) values (%s,%s)", (ti,te))
When you hand in data in this way (tuple or dict, depending on paramstyle, see below), common types are converted by the library. More complex types (such as timestamps) may not all be automatically escaped by all libraries (the PEP doesn't require it), so you may need to explicitly use (library-specific) wrapper objects for conversion.
In fact, each dbapi2-style package may choose a particular paramstyle, and you have to use the one the library uses. This is probably the most annoying part of this standard. (a few allow you to set different paramstyles, but this isn't standard behaviour) (Sure, the library reports which it uses, but it does mean that supporting all databases at once will need some wrapping for paramstyle)
There are also styles that reference a dictionary with keys, which can be handy as it does not rely on argument order.
The different paramstyles are:
- 'qmark' Question mark style, e.g. '...WHERE name=?' (hand in a tuple)
- 'numeric' Numeric, positional style, e.g. '...WHERE name=:1' (hand in a tuple)
- 'named' Named style, e.g. '...WHERE name=:name' (hand in a dict)
- 'format' ANSI C printf format codes, e.g. '...WHERE name=%s' (hand in a tuple)
- 'pyformat' Python extended format codes, e.g. '...WHERE name=%(name)s' (hand in a dict). (Apparently this includes 'format' style.(verify))
Notes on specific libraries
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
MySQL:
- MySQLdb
- uses 'format' paramstyle, though uses 'pyformat' if you hand in a dict ((verify) since which version. Supported in 1.2.2, and probably earlier but I've seen bug reports about earlier versions))
- See also MySQL_notes#MySQLdb_and_encodings
PostgreSQL:
- psycopg
- uses 'pyformat' paramstyle
- See also Psycopg_notes#Encodings
- PyGreSQL [4]
- PyPgSQL [5]
- pg8000 [6]
- pure python (doesn't even use libpq)
SQLite:
- pysqlite3
- uses 'qmark' paramstyle
MSSQL:
- mssqldb
- uses 'qmark' paramstyle
Oracle:
- DCOracle2
- uses 'numeric' paramstyle
Others:
- mxODBC
- uses 'qmark' paramstyle
- DB2
- uses 'format' paramstyle
- sapdbapi
- uses 'named' paramstyle
Result iteration - or not
Note that cursors objects are iterable (and rows are tuples) meaning you can do things like:
cursor.execute("select id,text from tag") for i,text in cursor: #do stuff
...which is noticeably shorter than the also fairly common and effectively equivalent:
cursor.execute("select id,text from tag") while True: row=cursor.fetchone() if row==None: break i,text=row #do stuff
It seems not quite everything supports iteration over cursors objects, though.
You should also know that fetchone() (and also cursor iteration when backed by fetchone) is relatively inefficient - it means many more roundtrips than necessary. Dealing with blocks of data using fetchmany is often a little more efficient.
As a rule of thumb: If you know the data set is always small, use fetchall (which fetches all data from the database's result set into memory). If it may be large, get chunks of data with fetchmany().
Note that certain libraries may not behave according to what you think. For example, while the idea is that fetchone/fetchmany/fetchall let you control when data is transferred and how much at a time, the last time I used MySQLdb it always fetched all result data to the client anyway - which can be very disruptive to memory use (and was, which is how I found out it does this. There was a way to work around it, but it was module-specific and a little fragile).
MySQL
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
MySQL python libraries(verify)
- MySQLdb, a.k.a. mysql-python [7]
- seems the most central module for the mysql/python combination
- has had some annoying behaviour bugs in the past
- jaraco.mysql
- A pure-python implementation of _mysql, apparently to the end of using MySQLdb without the need for a C compiler
MySQLdb
Cursors
MySQL doesn't support cursors much, and the MySQLdb package by default emulates cursors by copying all result data into memory, then feeding it through the DB-API fetch calls. On big datasets that can sometimes be a resource-fatal non-solution, in which case you probably want to do:
import MySQLdb.cursors conn=MySQLdb.connect(host='localhost',db='whatever', cursorclass=MySQLdb.cursors.SSCursor)
This brings in some new quirks, like that you must (read all results? and(verify)) close the cursor before doing anything else with the connection (hint: try/finally is handy), and that if you leave such a cursor unused it will self-destruct (on the MySQL side?(verify)) which can have restricting implications to how you write the code that uses this.
If you get AttributeError: 'NoneType' object has no attribute 'literal', that probably means you are trying to execute on a closed connection.
text encodings
Personal note: Through various bugs I've decided that I care much more about exact round tripping than about MySQL correctly interpreting the data. I still try for both, but only when that's useful and not only a headache.
You may have a client that you need to inform you are using an encoding separately from informing MySQL itself. There is a technical reason behind this, but is still annoying particularly since it is both pretty necessary and not something people often mention.
In code and specifically for MySQLdb, this boils down to a few solutions, one of which is:
conn=MySQLdb.connect(host='localhost', db='db', use_unicode=False, init_command="SET NAMES 'utf8'",charset='utf8')
You can also use the first cursor you create to execute() that USE NAMES string, but you still need the keyword charset on the connect call.
use_unicode=True means 'convert incoming string(-like) columns to python-internal strings', which e.g. converts UTF8 data to python unicode strings. False means to leave data as-is. While this means less assistance, it also means more consistent behaviour (e.g. in the face of utf8_bin). Outgoing data seems to always be converted(verify).
pre-1.2.1 problems
Using the charset keyword argument to connect() seems to do things the clean way: set the charset locally and tell the database SET NAMES (verify).
However, before 1.2.1_p2 (at least 1.2.0 and 1.2.1gamma3, probably others), charset was not a keyword to connect, and at the initial time of writing this, these were all current versions of MySQLdb so a workaround was required to avoid random breakage on different hosts.
If you do SET NAMES yourself means you inform the database but not the client, which will still report (in my case) 'latin1'. When MySQLdb converts (e.g. for use_unicode), it uses that same value it reports (even though the package author informs me this is a purely informational variable).
Experimentation seems to suggest that half a workaround is to set the charset variable in addition to executing a SET NAMES. That is, telling client and server everything is utf8 should allow you to pass through utf8 untouched.
conn=MySQLdb.connect(host='localhost', db='db', use_unicode=False, init_command="SET NAMES 'utf8'") conn.charset='utf8'
The problem with this is that it only seems to store properly when using use_unicode=True, while fetching fails - or rather, seems to depend strangely on environment variables. It also seems to mean that the module acts differently from the interactive interpreter than from mod_python for me (well, did yesterday, doesn't today, and yes, I did explicitly reload() these).
Which suggests to me that pre-1.2.1_p2 was simply badly broken in terms of unicode, and I cannot find documented elaboration on what the code does and doesn't do, so it's a gamble. Upgrade to 1.2.1_p2 or later.
For a project, I wrote a wrapper that tells MySQLdb to do nothing, and does all conversion itself, which even though it's a hack seemed more robust of a transparent unicode interface under assumptions true for this project (e.g. that all text is UTF8 in the database).
dev versions
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
The easiest way to install a version more recent than your package manager version is probably:
easy_install mysql-python
You may have to install a specific version of libmysqlclient(-dev)
Text
Lists
Multiple tuple inserts
Multiple inserts, which will often be faster than separate ones, can apparently (verify) be done like this (example from MySQLdb docs):
c.executemany( """INSERT INTO breakfast (name, spam, eggs, sausage, price) VALUES (%s, %s, %s, %s, %s)""", [ ("Spam and Sausage Lover's Plate", 5, 1, 8, 7.95 ), ("Not So Much Spam Plate", 3, 2, 0, 3.95 ), ("Don't Want ANY SPAM! Plate", 0, 4, 3, 5.95 ) ] )
This will also work for the hash-based markers, etc.
Note that this will repeat the execution, which is similar to but not strictly equivalent to:
INSERT INTO breakfast (name, spam, eggs, sausage, price) VALUES (%s, %s, %s, %s, %s),(%s, %s, %s, %s, %s),(%s, %s, %s, %s, %s)
However, this is not as clean, and usually means the same thing anyway.
Data encoding/translation/safety
to rewrite
lists and chunking
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
Every now and then you want to supply a list of elements into a single query, e.g. make select * from tabby where id in (%s) turn into select * from tabby where id in (1,2,6,8,21), particularly when the list is larger, since the alternative is doing as many selects as there are (in this case) ids.
Note that the only reason you should want this is if you have this data only on the client. If a temporary table works, use that instead.
DB-API does not allow lists, at least not while using proper escaping. Integers are of course safe, but I consider it bad style to abandon a good thing.
You can work around this by using the escape function that the specific DB-API library uses. Yes, this locks you to the library/libraries you actualyl use, which is also bad style, but it is still useful.
There is also an extra problem to be fixed, which lies in the fact that the list can be very long - larger than a single packet to the server will allow (limits may be set by server, client, and ODBC-like abstraction layers), in which case you should probably split it into separate queries that each handle a chunk of arguments.
In other words, you want a function that takes a single list, and returns a list of escaped strings (either a list of string elements or a single string), that you can substitute in safely via %s.
TODO: clean up and paste in code
Backend alternatives
SQL, transactional
- PostgreSQL
- MySQL
- SQLite (file-based) (being local IO may make this faster than networked databases for small to medium scale projects, and it scales decently)
When considering SQLite, note that if you want the ability for concurrent access but don't need transactions nor necessarily a relational approach, Berkeley DB may offer a more flexible and sometimes faster answer.
Zope's Zcatalog is lazier, which can be faster for certain applications.
File databases
shelve
You can use shelve to get automatic pickling of python objects to disk. shelve objects act like disk-backed dicts (but see the writeback option).
The shelve module chooses a backend automatically (dbhash, gdbm, dbm, dumbdbm), and inherits its properties, including possible extra files and file extensions.
You can also choose to get a shelve backed by something specific. In the case of BDB, there are two ways of doing so, shelve.BsdDbShelf and bsddb.dbshelve. The latter allows you to use a DBEnv, for concurrent, e.g. thread-safe access.
On writeback
The basic shelve is just a drop-in for a dict that does nothing more than unserializing from disk, and serializes to disk whenever something is assigned to a key.
That last part also means that mutable objects stored in a dict will not be automatically written back when you change them. This is perhaps most visible when you try to change mutable objects like:
d['ar'].append(5) #or d['di']['data']='val'
You are not actually setting the value for a key in the dict, so the write-back code that sits there isn't triggered.
You often want to be sure you will use the shelve's setter. The most general way to do this is to get the (full) value into a python variable, change it, then set it into the shelve again under the same key.
ar = d['ar'] ar.append(5) d['ar'] = ar
The above is true when writeback=False, the default. When writeback=True, you only access an in-memory cache of all values you get and/or set, meaning there are no problems changing mutable variables because nothing is written until you sync() to disk (note that close() is also an implicit sync()). However, keep in mind that this cache does not have a size limitation, meaning it can take a lot of RAM, and a sync (e.g. on close) may take a while flushing a large amount of changes.
Oldschool
See also File database notes, and Berkeley DB.
In python:
- dbm: interface to the Unix dbm / ndbm library.
- gdbm: interface to the gdbm library.
- anydbm is a generic interface that eases opening dbm-style databases; it uses the whichdb module to guess guess what type the file is and uses the appropriate module for it. It can also fall back to dumbdbm if a library is not available; dumbdbm is a portable but slower implemention
- bsddb for Berkeley DB
- dbhash: an interface to bsddb that mirrors the functions present on the dbm-style modules.
These libraries often make an open database usable as a dictionary object, enabling you to transparently persist data, although not all interfaces implement all dictionary functions. For example, keys() and values() may not work, and some provide iterators based on the hash, unordered but fast.
On concurrent access
- assume nothing is threadsafe unless mentioned otherwise. This includes shelve.
- in general, things do not allow concurrent multiprocess read/write, though you can often do concurrent reading. For concurrent writing, you could use file locking.
- bsddb:
- The legacy interface (bsddb.hashopen, btopen, rnopen) functions became thread-safe in Python 2.5
- you get concurrent multiprocess and threadsafe writing when you use use the bsddb.db API (since Python 2.3). Requires BDB 3 or later. Documentation and newer versions (should you need them) are here
shove
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
(You might also be interested in stockpyle)
shove can shelve objects into various types of stores, can use various tyoes of cache for recent items, for speed. It can do both of these in memory, in memcache, on disk, in databases, and in various storage services. It can optionally transparently compress things into the store.
On the code side, a Shove object acts like a dictionary, and is backed by whatever you told it to use (also makes it easy to drop in real backend after prototyping with something simpler).
It depends on libraries for certain functionality (but only once you ask for them, so they're not hard dependencies): sqlalchemy for database stuff (...that can't be handled by the standard lib like bsddb and anydbm), memcache for memcached, boto for S3 storage, durus, pysvn for subversion, ZODB for Zope Object DB)}}
The basic two arguments to the Shove constructor are store and cache, both URI-style specifications to say which store/cache you like, both of which default to 'simple://'. The rest of the keywords are handed to the constructor of the chosen store and cache classes.
General keywords:
- compress: compress in store (zlib, level 9 for some reason)
- sync: the amount of writes to buffer before syncing them to the store (verify)
Stores and cache types
Stores:
- simple: simple dict-like in-memory store (no size limit(verify))
- memory: threadsafe variant of simple (adds locking)
- file: file-backed variant of simple
- dbm: anydbm-backed store
- bsddb: bsddb-bascked store
- sqlite, mssql, postgres, firebird, mysql, oracle (uses SQLalchemy)
- takes keywords: tablename (defaults to 'store')
- svn: subversion-backed store
- takes more keywords (alternative to embedding login in URL?(verify))
- s3: Backed by Amazon S3. Requires boto
- zodb: Backed by Zope Object DataBase see e.g. [9]
- ftp: Backed by FTP server
- durus: Backed by Durus (python persistance)
Caches:
- simple: in-memory
- takes keywords: max_entries (default:300), maxcull (default:10), timeout (default:300). Item-set chooses to cull (relatively frequently(verify). Culling is random).
- memory: threadsafe variant of simple
- file: file-backed variant of simple (keys map to filenames in a cache directory)
- simplelru: in-memory LRU cache
- takes keywords: max_entries
- memlru: thread-safe variant of simplelru
- filelru: file-backed variant of simplelru
- memcache: Uses memcache module for memcached access. Address spec is semicolon-separated server list.
- bsddb:
- mssql, sqlite, postgres, firebird, mysql, oracle: Mostly just uses SQLalchemy
- takes keywords: tablename (defaults to 'cache')
Notes on cache and store speeds
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
This is not based on proper benchmarking. Don't trust it yet. Take all figures roughly. Assume things may well be different at different scales.
On cache speeds
When you want a shared cache, you probably want memcached. If not shared, memlru is much the same, except that memcached is probably faster in itself, though takes a little networking overhead.
The rest of the memory cache types are per-process caches. For that set, my suggestions are:
- use lru types (simplelru/memlru) over the non-lru caches even if access patterns are purely random;
- use memlru if you need thread safety, use simplelru if you don't (a little faster)
- set max_entries nice and high
...when writing with any regularity - of course, fill-once-and-read-many-times caches is significantly different.
In more words:
Avoids simple-type caches, because the cleanup present has noticably more overhead than lru-type when handling more items than fit in the cache.
The synchronized variants (memory://, memlru://) obviously have some more overhead than the non-synchronized versions. The locking itself takes very little time (milliseconds at most) but makes a noticable difference, in the case of memlru:// because that's on the same order of magnitude as real work, and in the case of memory:// because the (fairly continuous) culling constantly does a whole bunch of accesses, all of which will be synchronized.
The default max_entries is fairly low for many practical uses - you'll get better cache performance if you set this higher. However, when using a non-lru type (simple-style cleanup), this has a positive effect only until the cache is full (before it is full there's no culling at all, but after it is the culling work takes that much more. If you insist on this type of cache, it probably helps to set maxcull proportionally higher as well)
Notes on store speeds:
The speed of databases depends a number of things, those that any database storing varies with: how local/remote the connection is, and whether the database can serve you in/from memory, whether you are assisted by the OS cache, and such. Well-tuned and for non-huge data sets, this can be pretty fast.
Of the file-based storage:
The read times of file as well as file databases seems to all be pretty much the same. The write times for most seem to be about on the same order of magnitude, particularly for under a few hundred thousand records.
file is fairly fast and can scales decently when the underlying filesystem is not a limitation - it might also be fairly bad and/or inherit limitations like a per-directory amount limit. Given that the underlying filesystem is usually you can't control, a file-based database is often a more portable/predictable solution.
File-based databases seem to have a little more write-time overhead than a good filesystem, but overhead which is more predictable (less jittery). Access times are not affected much by record amount / data size (probably slightly, related to IO). Particularly read-much-more-than-write stores should probably opt for one of these.
bsddb and dbm seem to read slightly faster and write somewhat slower than file. bsddb seems to have more stable (less jittery - but also somewhat higher) access times than dbm (I get the impression that bsddb isn't used as efficiently as it could be), and both be more stable (and higher) than file.
sqlite somewhat comparable to file (can't re-test it right now - sqlalchemy is being broken on my host right now)
Notes on concurrent access
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
Others options
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
- SQLite via PySQLite,
- shove - see notes above (BSD license)
- Buzhug (pure-python, pythonic access, BSD license)
- kirbybase (pure-python, pythonic access, Python License (BSD style))
- Gadfly (pure-python, SQL-based access, license is 'keep copyright' notice)
- NetCDF:
- numpy / Scientific.IO.NetCDF,
- ...or pynetcdf, a standalone version
- pupynere, faster (but experimental?)
- cdb (disk hash map)
- PyDbLite (pure-python, memory-only)
| This article (or section) isn't finished, and may contain temporary best guesses. Feel free to fix and add, but consider that the original or last may be working on it on and off. |
ORM notes
SQLAlchemy
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
- http://www.sqlalchemy.org/
- seems more solid than SQLObject
- More than just an ORM
SQLObject
SQLObject is an object-relational mapper (ORM), which is particularly useful for data persistance that is implicitly ACID (and therefore somwhat less bother than direct pickling).
Approach
You define a schema on the python side in the form of a class. This class corresponds with a database table, rows to objects, columns to attributes.
In various setups this is the last you have to see of SQL and can think in objects from then on.
SQLObject supports at least PostgreSQL, MySQL, MSSQL, SQLite, Firebird, Sybase, Maxdb - though note that the behaviour will differ somewhat between backends, particularly related to typing, particularly in non-standard fields like BLOBs (which only work on some databases, including postgres, mysql and sqlite), and some complex types.
Basic usage
Adding: Instantiating an object using the schema class will create a row for it and return the object that corresponds to it.
getting/setting values:
Setting a member on that object will update the corresponding row, retrieving it will fetch it from the database (...so for speed, copy out the value instead of using the object member whenever doing that is safe).
Note that database values may not always be what you last set on an object -- rules of defaults-on-errors, procedural value filters, concurrency and sometimes duck typing apply.
removing:
To remove a row, call destroySelf() on the python object that represents it.
Doing a wider delete operation with a WHERE is also possible, see [13].
The schema
For that to work, SQLObject needs a schema - it needs to know what the database looks like to be able to transparently.
While SQLObject can be wrapped around any table that meets a few requirements, it may much tweaking to meet those requirements and to map around name conflicts and such. (although for some backends it can fetch all schema details live)
It is often easier to do everything from the Python side: describe the schema in a class, then have SQLObject create the according table(s) in the database.
Writing the schema is done by creating a class for each table, adding members that describe columns, indices and such.
The class must extend the basic SQLObject class, which implies an id column that is used to keep track of rows, that you can read out and select on if you want to. (int, primary key, with automatically generated values one way or another (verify). (this member could technically be called something else in the database - useful to wrap SQLobject schemas around existing database structures - but it is always id on the python side)
See the documentation for the available column types. They include IntCol, StringCol, UnicodeCol (transparent UTF8 coding), DateCol, TimeCol, DateTimeCol. If you want to store binary data or (complex) python data structures, BLOBCol and PickleCol are particularly worth looking at.
The various ...Cols can usually take (roughly) the following keyword arguments:
- length, primarily for strings
- default to specify a default value. This may be a callable, which will get executed later. You could e.g. use time.time
- unique=True to add a UNIQUE constraint
- notNone=True to add a NOT NULL constraint (NULL is None on the python side)
- You can use a general Col and specify a sqltype if you want something unusual
- alternateID=True implies UNIQUE and INDEX (...when you create from this schema, otherwise you muts this is true in the database) and creates a search function on your class; in the above example you would get Thing.byName()
You can also use foreign keys. (FKs just schemawise?(verify) If so, you would have to beware of database engines that ignore these, like MySQL's default MyISAM, as well as SQLite)
It seems that when you check for table existance by name, you should lowercase the name ((verify) docs), so you would e.g. do conn.tableExists('fish').
Database connections
To connect to a data store, you create a connection that is also a connector object - SQLObject remembers how the connection was made, creates connections as necessary later, and pools them (unless disabled using conn._pool=False). It should work properly under threading (e.g. threadpooled web servers), though there may be details to specific backends.
It keeps connections open, so long-running programs may meet the problem that a backends may time out and remotely disconnect without signaling this, which will cause an error on the first query that tries to execute on the actually-already-broken connection. You may want to add a keepalive to avoid this. (other solutions?(verify))
Since you select from the schema class and, you usually set the connector on it:
mydbmodule.Fish.setConnection(conn)
Now you can use that schema class to do selects, etc.:
print list( mydbmodule.Fish.select() )
The connection/connector object itself can be either based on the class for the specific backend you use, like:
conn = sqlobject.SQLiteConnection('/user/databasename.db') conn = sqlobject.MySQLConnection(host='example.com', db='foo', user='foo')
...or you can use a connector factory though a uniform URL like:
conn = sqlobject.connectionForURI( 'scheme://[user[:password]@]host[:port]/database[?parameters]' )
Example URLs:
'mysql://user:password@host/databasename' 'postgres://host:5432/databasename' 'sqlite:/C|/full/path/to/databasename' 'sqlite:/:memory:' #Older versions see this as an actual filesystem path
If the database serves more than one client / process, you probably want to disable caching. To tell sqlobject to not use cached values, declare your schema with a nested class:
class Fish(sqlobject) class sqlmeta: cacheValues = False #definition as usual
Note: before 0.7, this and more was not sorted into sqlmeta and was set directly in the class. This may still work, but is probably depracated.
Disabling the cache is the best way to have concurrent access work on widely connected-to databases. (You probably want to use transactions too, to avoid data race conditions) Disabling the values cache will slow things down, though, since each individual sqlobject attribute get will trigger a select
Each attribute set will cause an update, which, if not necessary (or when you prefer to do this per transaction) you can alleviate the latter by enabling lazyUpdate, which means you have to sync sych sets manually.
Example
import sqlobject from sqlobject import * import datetime,time #Define schema class Fish(SQLObject): datatype = StringCol( length=50, default='pending' ) name = UnicodeCol( length=100, alternateID=True ) entered = DateTimeCol( default=datetime.datetime.now ) datatype_index = DatabaseIndex( 'datatype' ) #Make SQLite connection to file, and set on the schema. conn = sqlobject.connectionForURI('sqlite:/tmp/sqlo.db') Fish.setConnection(conn) # (there are other ways) #Create tables according to schema. #(but don't try to create tables that already exist) Fish.createTable(ifNotExists=True) #Create three objects Fish(name='foo',datatype='pending') Fish(name='bar',datatype='done') Fish(name='zot')
Using the data:
#byName was created because name alternateID field print Fish.byName('zot') #you can update things directly Fish.byName('zot').name='littlebunnyfoofoo'
Show all contents:
>>>for t in Fish.select(): #no condition means select all. print " id %s: %s"%(t.id, `t`) Result: id 1: <Fish 1 datatype='pending' name=u'foo' entered='datetime.datetime...)'> id 2: <Fish 2 datatype='done' name=u'bar' entered='datetime.datetime...)'> id 3: <Fish 3 datatype='pending' name=u'littlebunnyfoofoo' entered='datetime.datetime...)'>
Basic and complex queries
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
Selecting/fetching objects is similar to a SELECT *, giving you a list of objects:
The whole table:
Fish.select()
You can easily use basic SQLish query features:
Fish.select(coltype='done', orderBy=Thing.q.date_entered, limit=10)
Using your own WHERE clause:
Fish.select(' length(name)>3 ', limit=10)
By a alternateID (UNIQUE, INDEXed) column:
Fish.byName('zot')
(The schema has created this function automatically. You do have the option of changing the name of the function it creates)
Using python syntax magic in SQLBuilder, using a relatively simple expression instead of an explicit where):
Fish.byName( Fish.q.name=='zot' )
This is perhaps mostly interesting to use in join fields.
Notes:
- If you want multiple clients, you'll probably define the schema with value caching off. However, that means that every get (e.g. .name) and set (e.g. .name=value represents a query to the database. This can be very slow if, for example, you do this in a loop (enable debugging to see whether this is happening). Try to do all sets/stores in one go.
Joins and such
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
Unsorted
BLOBCol and PickleCol
Additionally, BLOBCol are base64'd to be safely storable in text fields. This means both overhead and that other clients can't read them without some wrapper code.
PickleCol inherits this, so this is extra overhead on top of the pickling.
Most backends have a query size maximum, and SQLite SQLObject uses the query string to set any and all values, meaning that you are restricted by whatever query size limits apply, often 1MB -- which probably becomes less than ~0.8MB given the overhead from pickling and base64.
You can try to work around this with separate parametric queries, but such solutions are generally specific to backends, and should be careful with locking.
MySQL
When the table is created, and a BLOB is called for (for a BLOBCol or PickleCol), MySQL chooses a particular variation of BLOB based on the length you give it (see MySQL notes.
If you do not specify a length it defaults to a TINYBLOB (up to 255 characters), which will easily lead to errors like "Pickle data truncated."
Specify a length that will satisfy all your needs.
SQLite
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
While by default you can store 1GB per row / cell, sqlite's query size limit is 1MB. IF you cross either of these limits, you get a "DataError: String or BLOB exceeded size limit." (See also SQLite notes)
This can be a problem because sending text/blob data of that size (and less on the endpoints; consider the base64 overhead) are encoded into query itself.
The 1MB query size limit is the default in SQLite. Some distro packages have it set higher, and you can compile your own version of it with a higher limit. This is one solution to this problem - although your code still won't work on sqlite libraries with lower limits.
If you want to try to work around this with parametric queries, know that sqlobject doesn't use sqlite blobs or the Binary class to back BLOBCol/PickleCol - it pickes, then base64s, into text values. SQLObject doesn't seem to support sqlite blobs at all (that is, using Binary to set (and get) buffer objects seems to fail).
The code you'ld want would probably be something like:
pickledata = pickle.dumps(data,2) cur.execute("INSERT INTO test (test) VALUES (?)",(base64.encodestring(pickledata),))
(note: entirely unverified)
Note: since sqlite's storage is duck typed, it doesn't matter that the column type is VARCHAR, and it uses text values. It does make a difference, though, since currently SQLObject doesn't use or support sqlite's blob types; it'll trip over the buffer objects(verify).
This doesn' matter so much, since text values are due to the same limit. You can work around the problem(verify) by leveraging pysqlite2 to set the large text field (imitating SQLObject's encoding) via a parametric query. Be careful about database locking, though.
See also
Yet to read:
- http://wiki.sqlobject.org/connections.html
- http://groovie.org/articles/2005/11/01/how-to-use-database-agnostic-sql-in-sqlobject
Shelves
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
Using a BDB shelve with a DBEnv can makes for thread-safe object storage. See also Python notes/Databases#shelve.
others
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
- ORMit
Buzhug notes
| This article/section is a stub — probably a pile of half-sorted notes and assertions some of which may well be wrong, and not verified as a whole. Feel free to add or refine. |
Buzhug is a pure-python (and therefore portable) file-based database.
Depending on your needs (typing, concurrency), you may wish to consider bsddb.
Example
See also the buzhug tutorial
Create/open
from buzhug import Base db = Base('/path/to/directory/') db.create( ('name',str), ('age',int), ('coord',str), mode='open')
Base():
- A single directory represents a single database, and will contain various files for it.
- And does not have to exist yet, and should probably be empty if it already exists.
db.create():
- type is one of:
- str
- int
- float
- unicode (should use format specification, see below)
- datetime.datetime (should use format specification, see below)
- datetime.date (should use format specification, see below)
- mode:
- 'open': if it exists, just open instead (but may not have the same fields)
- 'override': overwrite database present here.
To fetch the field/type information:
- db.field_names: lists the names in the order you specified in create()
- db.fields: maps from names to the types they were created as
You can also Base(path).open(), which will IOError if it doesn't exist.
You should eventually db.close() things you open/create.
Inserting
record_id = db.insert(name='foo1', age=30, coord='1.3 3.4') record_id = db.insert('foo33', 40, '4.11 1.87')
- positional or keyword arguments
- missing fields will be set to None
- referring to fields that don't exist causes exceptions
...as strings
Personally, I recommend never needing the following. Do all conversion explicitly so you know when things go wrong.
When you receive data via networking, they are probably bytestrings as type str. If you have previously told buzhug how to decode that str into the actual type, e.g.:
db.set_string_format(unicode,'utf-8') db.set_string_format(date,'%Y-%m-%d')
...you can then pass such strings in directly (into db.insert_as_strings()) and have buzhug do the conversion.
Fetching data
select
This is the style you should probably learn first, because it is the easiest way to get sorting and to get updating.
All values for given field(s)
result_set = db.select(['name'])
Filtering by value
result_set = db.select(['name'],age = 30)
Filtering by options(str,unicode)
result_set = db.select(['age'],name = ('foo1','foo2'))
Filtering by range (int, float, date, datetime)
result_set = db.select(['name'],age = [30,35])
With more complex predicates, regexps, etc.
#(examples later)
Sorting:
A result_set from select() can be sorted. For example:
results = result_set.sort_by("-age+name")
...meaning descending by age, and ascending by name as secondary sort
by identifier
Hash-style access:
record = db[rec_id]
Iterator style
Iterator-style, in which you can add filters:
for record in [ r for r in db if r.name == 'pierre' ]: print record.name,record.age
This is usually slower than filtering with select(), since this fetches all records(verify).
Updating
Records are normally selected for reading, which is faster. If you are selecting things you want to update, you must(verify) use select_for_update(), and then call:
record.update(field=value)
There is concurrency control in that if two users select the same record and try to update it, bughuz will notice they both wanted to update from the same version, and will throw a buzhug.ConflictError.
Deleting
db.delete(record)
...or, if you know the record ID:
del db[record_id]
You should:
db.cleanup()
from time to time.

