Varied databases

From Helpful
(Redirected from ACID)
Jump to navigation Jump to search
Pieces of larger or complex systems
Varied databases
Message brokers/queues; job management
Locking, data versioning, concurrency, and larger-scale computing notes
Database related

More theoretical

Practical notes

Describing database systems

On database types

Block store

Block stores are access a large series of bytes, in moderately-sized blocks at a time.

It's the most 'do everything yourself' variant of storage.

There are almost no use cases where users or services want a block device directly. Yes, in theory some specific servers might be more efficient doing their own management of their very specific specific data structures on disk - but many of those choose to use large files the same way, presumably because it amounts to much the same, and takes over some of the management work.

May just mean it's exposing a large disk at low level - SAN tends to mean storage exposed this way.

In cloudy context, it tends to mean 'when you connect a disk, we figure out the backing details' (which may be SAN style under the hood).

Object store

Object stores often mean storing arbitrary-sized blobs of data, under arbitrary names.

(A blob store, today, usually refers to object stores (though also, 'Binary Large OBject' still has relation to doing this in databases where it was perhaps less of a fit)).

And yes, you could see this as a lightly specialized key-value, in that

  • very basic basic operations: mostly store, fetch, and delete, and maybe list
  • there may not be structure to the names - no meaningful interpretation by the store itself (less so than e.g. filenames)
but this varies

Differences to kv:

  • object stores may allow large objects (gigabytes or more), kv might limit values to a few kilobytes
roughly so that you use kv for short term, OLTP-ish stuff, and object stores for bulk storage and archiving stuff
  • object stores seem more likely to have list operation in some way, where kv tries to avoid it
  • object stores usually let you separately fetch metadata
kv often implicitly suggests you make that part of your data
  • object stores may have access control per item (and per moderate unit), kv often has none
  • object stores may avoid transactions (there's typically no need), and be more eventual-consistency, particularly in designs that want to scale well
as may kv.

A lot of these seem because object stores are often aimed at "dump a lot of files in here" (with less or no implied directory-style semantics of a filesystem)

For example, S3 is an object store much more than a kv store.


It's worth pointing out that a filesystem is a database.

Code (usually kernel code) manages all the data structures, it has strict semantics (e.g. POSIX's) that ensure things happen in an atomic, consistent way.

Yet it turns out a filesystem is a relatively poor database.

For a slew of reasons, perhaps summarized by that this bit of convenience specializes in nothing.

Reasons include:

  • many things are not transactional - meaning you need a lot of error handling in every program
e.g. a treewalk can find itself erroring out on something that it was just now told existed.
(This is where NoSQL wouldn't help much either - they rarely even try to be transactional. Even in relational where you often can get this, you still have to think about it)

  • some restrictions that are there in theory but hard to keep
say, filesystems tend to have a limitations like '255 characters in a filename' and '32667 characters in a path'.
so what happens when you're a few bytes under the latter limit, and you move a directory into a subdirectory so that directory doesn't break that limit, but files in it now do?
if that move is just "directory parent is now this other directory", this is often not checked - which means it will work until some API calls break
if we were to check this, we would need to do this for a lot of items.

  • Text encoding is usually... breakable.
In bytes-based unices, which are conventionally UTF-8, it's entirely possible to store invalid byte sequences.
In windows, which are mostly UTF-16, it's entirely possible to get invalid sequences.
I have one file on my windows drive that I haven't been able to delete in years. Explorer cannot do it. the CLI cannot do it. Probably something can, but I'll be damned if I can find even the reason it exists this way.
Languages that want to interface with both linux and windows have to do even more creative things

As such, APIs that want to be portable between filesystems have a task of wrapping different existing APIs creativity, and somehow deal with the fact that the underlying filesystem might well allow incorrect data.

  • Tree structure can be weird
for example, symlinks and links may mean that what you previously thought was a tree is actually an acyclic graph or even cyclic graph instead.
the last meaning that treewalking is technically infinite, except it will error out on length at some point
and before you think that's a *nix thing, windows has even more types of links

  • there are some practical cases of unnecessarily bad algorithmic complexity
consider e.g. an ls that goes through items and stat()s each. It is possible that is implemented in a way where each stat is linear time with the amount of directory entries (linked list or similar). That makes a simple directory listing quadratic time with the amount of entries
and trust me, windows is no stranger to being slow as molasses when viewing more than 20K or so things in a single directory.
also part of why distributed stores don't like to have a 'list' operation, or if they do, make it part of local organization they put on you (consider e.g. S3 buckets)
  • The POSIX interface (and the windows one too) is defined very serially.
Unnecessarily so, but thirty years ago it mostly worked fine, and now we can't change the semantics.
Consider that a treewalk on a POSIX style filesystem is tens to hundreds of thousands of syscalls, that also have to happen in absolutely strict sequence because the semantics require it. Not even the much smaller latency of SSDs will help much here.
which is why distributed filesystems will point out their own, non-standard API - to sidestep these issues
  • metadata isn't centralized / indexed, so most queries are inefficient
most writes go to two places on disk. On platter that was always bad for performance
I have one directory structure for a bunch of structured data - 500K files, 19K directories.
When the metadata is not cached in RAM, it takes eight minutes just to walk through and just list their names
That's just stats, directory-opens, getdents, because that is the only way to fetch that structure
Even when it is cached, it still takes three minutes
I have that same directory listing mirrored in a database.
It's milliseconds in total to do most things I care to do.
And a few hundred milliseconds for rare things.
Even a total dump only takes seconds.

Computer data storage - Partitioning and filesystems


The old standby.

Also great at structure, generally the best consistency managers we have, and allows things like relational integrity that can actually be enforced.

The consistency is not due to their relational nature, in fact that makes things a little harder. It's just that that was one of their design goals, because they used to be the only real choice in serious database.

That same consistency is what fundamentally puts a limit on their ability to scale. (even loosening the consistency management only makes them somewhat faster. It is their overall model that doesn't scale)

NoSQL scales better in part because its guarantees are simpler, and less strong.

NewSQL is an interesting new balance between the two, but comes with footnotes that you really should know.

The below won't go into established relational databases. It is assumed you know how to find information about Postgres, MySQL and such.

...though things like Apache Derby (previously IBM Cloudscape) can be interesting if you want to, basically, embed a small (~3.5MB) RDBMs within the same VM as your Java application. (though it can also be set up client-server style) Or, perhaps more likely, something does this to give you standard SQL OLTP without requiring you to install and set up something more complex.

Key-value stores

You ask for a value for a given key. Typically no structure to the data other than your interpretation after fetching it they lie very close to object stores and blob stores (the last basically file stores without filesystem semantics).

When this fits your data and use, these are often low-bother ways to store a whole bunch of information, often with some corruption recovery.

If you have something in a RDBMS where you are actually mostly retrieving by primary key, and doing few or no joins (which may include things like simple data logging), you could use one of these instead.

Disk-backed or not Various file-based stores (see e.g. File database notes) are effectively disk-based key-value stores.

Since they are often indexed, and cached in some way, you may easily think of them as hashmaps that happen to be large and stored on disk.

Used more than you may think; Berkeley DB is used in many embedded setups.

There has been some revival in this area. For example, Tokyo Cabinet is basically a modern, faster reimplementation of the dbm. (with some extensions, e.g. Tokyo Tyrant to have it be networked, Tokyo Dystopia to add full-text search).

When not disk-backed, they are effectively in-memory caches (e.g. memcached), and sometimes also useful as message broker (e.g. redis).


Whenever each value is completely independent from all others, you can spread these to roughly any scale you want, by putting then on any amount of machines and not even telling the others about it/

Once you want things like altering multiple things atomically, though, you get one of three answers:

  • haha no.
  • not with acceptable performance
  • 'eventual consistency probably' is good enough, right?

Document store

key-value where the value is structured data, often not following a strict schema.

It is also frequently possible to index on these fields

Often presented as JSON (or XML, though XML databases can be considered a specific type of document store).

In contrast with e.g. relational data, documents are often altered individually, delivered as-is, and not heavily linked.


You can often think of these as key-value stores with larger values and a nicer interface to change them.

In that documents are whole things and mostly independent - you get no relations the database knows/cares about, and any indexing mostly just implies a little overhead to your CRUD.

The difference is that it is easier to expect documents to act as rows, and maybe some transactions.

The answer to transactions is still often one of:

  • haha no
  • not with any speed
  • eventual consistency is good enough, right?
though you can often make all alterations to a single document happen at once
and some will make that 'eventual' sooner than others

Column store

Regular RDBMses store data by rows, which makes sense when rows represent individual things you usually handle one by one.

A column-oriented DMBS basically store columns at a time, which makes sense when you're typically working on columns, e.g. primarily doing aggregations, summaries, statistics, or analycics on what works out as a column at a time (with less focus on data relation and structure).

It also tends not to need indices for these applications.

There aren't really NoSQL column stores I know of, but it's good context for wide column stores.

Wide column store

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

A wide column store, sometimes extensible record store, could be thought of a key-value store where you make the key more complex.

For example, if you happen to have tabular data, you might use keys like: (row,col) → data

And yes, for that example you could easily absolutely abuse a plain key-value store, by smushing row and column into a single key.

But wide column stores tend to add semantics, e.g. ways to work on related keys.

You can see this as a data model pasted on top of basic key-value storage, meaning the storage is easily distributed by nature - unless your data model breaks that.

This can be a flexible way of storing a few types of structured data. Consider:

  • modeling a single matrix-like table like (row,col) -> data
if it's very sparse it may save a lot of space
if it's very large you may find the inherent ability to distribute useful
but if the table is small and you usually need all of it this would just be a very inefficient way of storing it. If the values are very small and/or you usually need a lot of them, it's an particularly inefficient way of accessing it, you have to remember things about that table (like its size) separately, you can't necessarily fetch ranges very easily
  • modeling EAV-like storage and access (more scalably/naturally than putting it in relational)
(entity, attribute) -> value
  • ...and more.

Real-world wide column stores may easily be made for a few specific cases, which makes it more organized or more efficient for a specific use.

For example, google BigTable works something like (row:string, column:string, time:int64) → string and adds some semantics ().

And consider e.g. Cassandra, which tries to be as table-like and SQL-ish as it can manage.

In particular, various have the idea of column families (making them more structured than e.g. EAV would suggest)

For example, bigtable chooses (data under a row key is ordered, read/write under a row key are atomic (helps related sites be on similar nodes, helping access speeds

  • columns are grouped into column families
which makes them more structured under the covers (and helps access efficiency)

Bigtable suggests you store and access things with keys like:

(row:str, column:str, time) -> data

For another example, cassandra has partition keys, clustering columns that helps direct how data is stored.

primary key is partition keys + clustering columns?

From another view, each row still maps to many columns that could be selected. Part of the access efficiency comes from the fact we need only select the columns we are interested in, and this is part of the lookup, not the filtering.

Wide column stores are, by themselves, just extensible record stores, but various implementations add additional properties

Examples: Bigtable (proprietary) HBase Cassandra Cosmos DB (when using its Cassandra API)

Bigtable is a distributed data chunk storage.

While bigtable is built on things only available within google, implementations based on the whitepaper are, like Cassandra and HBase (on top of HDFS)

It aims to meet a moderately specific sets of requirements, which happened to be useful for google search, gmail, and such.

It aims to serving potentially large chunks of only loosely structured data, with low latency where possible/necessary.

It actually is a moderately specific a data/access model, with ordering that allows for a B+-tree-like distribution model across various nodes

The data model is described in the whitepaper as "a sparse multi-dimensional sorted map"

In other words, a map.
With some fancy implementation details.

In terms of uniqueness, the map is from a 3-tuple:

row key (string),
column key (string), and
timestamp (int64),

to a string value

row key (string)

(row,column,timestamp) -> value )

(Note that to Bigtable itself, these values are uninterpreted bytes. It is up to clients to give it a meaning, as necessary, e.g. serializing )

This is not the way you access Bigtable, and most real-world applications of Bigtable are neither quite like a map or like a row-column table. Both of those can be useful assisting mental models, but you are likely to use it quite differently. Note that columns are weak versions of RDBMS-style columns at best.

(Note that you can use the bigtable setup in more than one way, because there is no predefinement or enforcement type of schema, only one implied in use.)

Consider for example the white paper's example is to store, for each URL, the page contents and sites that linked to this one (data you might want want for some particular linkage experiment or so, and would generate from crawl data).

For example, if and have links to, you would have stored items probably at least including the following:

('', '', 0 ) -> 'link name used from foo'
('', '', 0 ) -> 'link name used from bar'
('', 'contents:', 0 ) -> 'whole_html_page'
('', 'contents:', 0 ) -> 'whole_html_page'
('', 'contents:', 0 ) -> 'whole_html_page'

In this case, anchor is the column family you are using to map from link URL to link name.

In comparison to RDMBS, you would have a few options, and only adding an anchors table to the schema is even vaguely sensible (a compounded list in a single column would not scale, and a column per site would be crazy because of the way that is stored)

In comparison to maps, well, you can indeed use a basic a map like the above, creating a key single key from the elements above.

Except, of course, Bigtable rows and columns have meaning, in terms of querying, management, access control, (and also in implementation details, such as that compression happens per column family), and such.

Mutations happen atomically at row level. You can open a row as if it were a record, and add/alter/delete columns from it, then apply (store).

Reading can be done in a few ways, but mostly done by column family/families in a row or in any row(verify), optionally with row/column/timestamp filters(verify).

See also:

Often used with

that imitates a row-column setup at scale (and often with column sparsity)
(with the implication that each cell can be formatted differently, but you usually don't want that)

Depending on how you use this, you could store something imitating tables, rows, and columns, but implements it as a variant on a key-value store.

This is

unlike classical databases in that it doesn't store rows as units,
unlike column stores in that it also doesn't store columns clustered
though wide-column can be hinted to do that if possible


sometimes lead to very compressible storage
you may have higher per-cell overhead, but also be fairly predictable
for extremely large stores, this scalability will often make up for (for reasonable queries)


the column family stuff (hinting and grouping) can make a lot of difference
which is great if you know how to use it
and less great if you don't


  • at smaller scale, it tends to be less efficient.
join and queries involving multipe attributes are often less efficient
writes may be less efficent
Consider e.g. that updates for multiple things in a 'row' has to go to ten different places, and in a classical database just to one.

So arguably, this is more of an OLAP than OLTP thing.(verify)

Examples: Cassandra, Bigtable, Spanner, HBase

Hybrid columnar



A graph in the mathematical sense, of a node-link structure.

Certain types of data are very handy to see this way - sometimes because it is a good fit, sometimes because it is structured but still free-form (so you don't end up with 100 relation tables and a headache of joins, and can be much faster when you typically look for very small parts of that data)

Search engine

While generally considered an index based on a primary data store elsewhere, that also makes that searchable index the thing you actually use.

Plus there are projects that do both.

Message brokers

Not a storage model, or NoSQL, they just turn up in the same places, because they are useful tools when sending messages with more than two fixed endpoints involved (be it processes, hosts, datacenters).

See also Programming_notes/Communicated_state_and_calls#Message_brokers.2Fqueues.3B_job_management

Data consistency and versioning, and its concepts


This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

ACID (Atomic, Consistent, Isolated, Durable) is a set of properties you want to guarantee that the data it shows you is correct and consistent and up to date at all times. 'ACID compliant' an be intuited as 'a guarantee that my data is safe and correct'.

For a large part, that is about how a sequence of interactions is handled, in particular

dealing with sequences of statements that must be applied all together, or not at all: transactions
parallel execution
because most real practice has more than one user
because it make most parts a lot harder (arguably, there should be a P in the acronym, because parallelism is usually the part that introduces the most complexity, edge cases, and tradeoffs, and is sort of important to understood well)

The parts:

  • Durability
roughly: when the backend acknowledges a request, this change will survive, and you'll get it back as-is, even around various typical malfunctions
in part a 'guaranteed to be actually written to disk', in part about recovery, in part about ensuring other ACID parts during recovery
and implies implementation details like "write a new copy, then retire the old one" style journaling (see also e.g. WAL) is a popular solution - instead of e.g. "update/overwrite the existing data as you go" because while that can often be a little faster, failure halfway through might make that unrecoverable)
  • Consistency
roughly: data respects all imposed integrity rules (constraints, cascades, triggers, etc.) at all times
attempting a operation that isn't allowed will be rejected, and will never lead to showing invalid state to clients at any time
if the transaction turns out to violate a constraint by one of its later actions, it must be rolled back
this is basically impossible to do without atomicity and/or isolation
  • Atomicity
roughly: a set of changes either happens fully, or not at all
since many operations in a transaction aren't even individually atomic, this certainly requires extra bookkeeping (e.g. of before and after states)
the easiest method is locking, but this also puts a very hard limit on concurrency. Concurrent methods are more involved.
  • Isolation
transactions should not interfere with each other when concurrent
ongoing interactions are invisible to each other; even when actually handled in parallel, they should look to any client as if they were done sequentially; what one transaction alters is hidden from all others
largely means they cannot react to new state from another transaction
required for consistency
this conflicts with concurrency (more fundamentally than some think - varying somewhat with how complex the data structures are)
(at the very least, things become more complex when the data they deal with overlaps in any way)

The original explanation of ACID remains a fairly open-ended description, not a very string set of constraints, and certainly not rules of how to implement that.

If some of these descriptions seem to overlap, that's because they do.

Which leads to arguments like "is what isolation says already covered by atomicity?", so it's probably there because we planned to relax it from day two.

In fact, it seems to have been added later indeed;

ACD from J Gray (1981) "The Transaction Concept: Virtues and Limitations"'
ACID comes from T Haerder and A Reuter (1983), "Principles of transaction-oriented database recovery"

One way you can see the A/I difference is that

atomicity is about all of a transaction's writes becoming visible atomically as table state
isolation about different reads in another transaction not seeing different versions at different times within it
one focuses more on correctness, the other more on ordering
consider you could write a system that switches in one transactions's new state perfectly atomically, and have another transaction read from the earlier state at one time, and the new state later (Nonrepeatable reads, see below)

Such details is also one reason SQL defines isolation levels, which actually relax from strict ACIDity.

So yes, real-world relational databases are typically not configured to be fully ACID (and a few can't even be).

This is also roughly why a lot of NoSQL often does not care about strict transactional correctness.

ACIDity of RDBMSes versus NoSQL

ACID is historically largely related to RDBMSes.

While ACID is just a set of principles that any system could follow, maby RDBMs are designed with the idea that without considering ACID in your design, you'll make some sort of mess.

NoSQL is somewhat different.

Most NoSQL simplifies the data model it can store (and usually also removes transactions) so that it needs fewer guarantees for it to work.

This makes ACID almost irrelevant. If you meant ACID to mean "operations probably not incorrect", you mostly have that:

Atomicity may be simpler - particularly when you are restricted to only 'get' and 'put' single items.
and it is generally not guaranteed on bulk operations as a whole. They'll say something like 'best effort'.
Isolation is often irrelevant (no transactions)
Consistency has almost no rules to adhere to (beyond rejecting nonsense requests), so is fine
(Durability is a separate conversation)

Sure, from the context of relational, NoSQL breaks relational, but that was almost the point.

It can be useful to dig a little deeper.


ACIDity of single-row (or key) alterations, versus
ACIDity of multi-row (or key) alterations.

A RDBMSes engine needs to give multi-row ACIDity to be taken remotely seriously.

Most NoSQL isn't multi-row ACID because there are no multi-row alterations in the first place. You often don't get to group alterations in transactions that get applied atomically.

Are they single-row ACID. Usually easily.

The issue comes in when they call themselves ACID because they're single-row ACID, but they also happen to allow multi-row updates.

If you do that from your code, the result is not ACID overall. You don't get that guarantee.

And we would really like multi-row updates. Particularly around denormalization, it would be great if we can say "Update this related set of denormalized calculated fields, either all or none, because that means the app won't show something inconsistent". By and large, we cannot. We are running either on 'hope for the best', or something inbetween because some BASE implementations are pretty neat.

But technically we weren't lying.

Also, Good luck knowing for sure.

The D is sort of a separate case.

Durability doesn't mention time.

In RDBMSes, it often means that 'if I acknowedge, it means it's guaranteed to be actually written to disk'. Be it tablespace or recovery log, you'll see this data back unless something goes unusually wrong.

NoSQL is split between

Write-Ahead Logging like that, and
"sure, we save our in-memory state to disk every few minutes"
(and a few "neh, we're in-memory only", which is fair)

And in a lot of use, it's a fine tradeoff. Just know that you're making it, and that this might be a bit of a stretch of D.

Aside from NewSQL, there are few NoSQL cases that try to give you multi-row ACIDity (or detail the cases).

VoltDB [1]
FoundationDB [2]

You can ask "Is S3 ACID?" and get completely different answers, including

  • basically yes, the individual operations are very predictable
It only does get, put, list, and delete.
get and put of completely independent items barely gets into any consistency.
list may not reflect the objects present a second later, but you would only expect that if there were transactions, which there are not.
  • No. It's not a database, it's an object store. Your question doesn't make sense.
  • No. It does not give read-after-write consistency[3]
(this changed in 2021, basically after 15 years, and still seems slightly qualified)
  • Maybe; you could add something like transactions on top,
but it'll probably work out as batches and fallback cleanup. Close enough, right?
You still don't get multiple such interactions working in a serialized way. No I or A, really.
But hey, you generally wouldn't expect to.

Also, if you use S3 as a backend to a metadata store elsewhere - which is common and convenient - that's even less likely to stay consistent with each other over time.

Isolation levels

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

ACID is great theory, but it turns out that most parallelism will easily be faster only under some conditions, and in various real-world ones things can still come down to "do things largely in series, or it becomes way too complex to handle".

So another question pops up: Do you want a tradeoff between which guarantees we make, for extra speed?

Isolation levels are a controlled way to loosen strictly ACID behaviour in RDBMSes.

Roughly speaking, isolation levels control what sorts of non-ACID behaviour is allowed whenever you don't do explicit, fairly complete locking.(verify)

Three things you want to avoid (at least, the most mentioned and relevant):

  • Dirty reads - refers to fetches that ignore any transaction or locking
Means we don't have to wait at all - great for speed!
May e.g. mean one connection may read data that is not yet committed by another
when multiple clients are writing relational data that has to make coherent sense, you can assume this will go wrong at some point
  • Nonrepeatable reads: getting different values for the same row when trying to read it multiple times in the same transaction
...e.g. because of another transaction's committed alterations. Reads can be non-dirty, but still non-repeatable.
We usually care about nonrepeatable reads refer to those within the same transaction. There is also the concept of "nonrepeatable within the same query". The fact that we use the same word for both is confusing, yes.
  • Phantom behavior / phantom reads refers to transactions 'magically' seeing new rows
i.e, seeing that the result set changes (rather than that specifically identificable row values change)
e.g. caused by WHERE clauses that select from another table at different times, which can find new matching rows after they are committed by something else.

Approximately speaking,

dirty reads are reading non-committed data from another transaction
nonrepeatable reads are often because another transaction COMMITted an UPDATE
phantoms are often because another transaction COMMITted an INSERT or DELETE)

'Approximately', because with more scrutiny there are a few more specific cases .

You don't have to think about those terms directly, because SQL-92 defines isolation levels:

    • Basically 'anything you find will do', does not use/observe read locks
    • allows dirty read
    • allows nonrepeatable reads
    • allows phantom behaviour
    • Won't read uncommitted data, doesn't ensure that read rows won't change within the length of this transaction
    • avoids dirty reads (only)
    • allows nonrepeatable reads
    • allows phantom behaviour
"Won't read uncomitted data" and "allows nonrepeatable reads" seem to conflict. The point lies in the difference between a query and a transaction. Nonrepeatable reads in a query means it will not see different values if it needs to read a row twice. But two identical queries in the same transaction can still see data if another transaction has completely committed data to the relevant table. If you want all state to be repeatable within your transaction, you want either locks, or the following:
    • Won't read uncommitted data, and does ensure read rows will not change within the transaction. Put another way, you're essentially reading from the same snapshot each time.
    • avoids dirty reads
    • avoids nonrepeatable reads (the main point, since it's the name of the thing)
    • still allows phantom behaviour (...wider locking may avoid it, without being quite the same as SERIALIZABLE. Depends.)
    • usually implemented by placing locks on all data relevant to the query
    • strictest -- no transaction interaction possible at all
    • Tends to be noticeably slower, because it is often implemented with zealous locking, or sometimes even by never handling two transactions on the same object(s) at once(verify).
    • avoids dirty reads
    • avoids nonrepeatable reads
    • avoids phantom behaviour


  • Some database engines may offer their own additional levels
e.g. SNAPSHOT[4]
  • oracle and postgres say serializable when they mean snapshot(verify)
  • most of these isolations aren't ACID - that's sort of the point.
  • Most database engines can be asked to behave to different levels, e.g.
MyISAM doesn't support transactions at all, so is probably best described as READ UNCOMITTED always
InnoDB supports all four(verify), and defaults to REPEATABLE READ
its MVCC nature means it internally has the equivalents of READ COMMITTED and SERIALIZABLE.
Default is READ COMITTED(verify)
The other two don't exist, and asking for them actually gives you the next-strictest behaviour:
Note that given its MVCC design, reads tend to spend less time waiting for locks, so looser settings often aren't as useful
  • DB2 has its own terms and goes into a few more distinctions and may be interesting reading.

See also:

DB2's terms may also be interesting; they go into a few more distinctions

  • Repeatable Read (RR)
  • Read Stability (RS)
  • Cursor Stability (CS)
  • Uncommitted Read (UR)
  • No commit (NC)

See also:

Considerations to allowing dirty reads
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Using SET TRANSACTION ISOLATION LEVEL to READ UNCOMMITTED (or e.g. MSSQL's per-query WITH(NOLOCK)), can be used to allow dirty reads to happen, which means the value is read out without locking considerations.

This can be useful when any value currently in there will do. This may be true in the case of social-site profile information, continually updated statistics tables, and such. Any value from now to a few seconds ago will be perfectly good - using locks and listening to transactional accuracy will only really mean more waiting.

Of course, dirty reads may at any time return uncomitted, old, or duplicate data (because you are asking the database to violate ACIDity for the read), meaning you should only do this when you know you'll do no harm, such as when you only read the value to show it and no write/update will even be based on it. (Much the same goes for non-repeatable reads, really)

Note that MVCC-style engines, such as postgresql and MySQL's InnoDB, avoid a lot of classical style locking by isolating data versions, which means there is little speed increase in allowing requests to give you dirty versions.

See also

Serializability, Linearizability


This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

There are various formalizing takes on what happens when systems become scaled up and distributed, in terms of guarantees, and also things like bandwidth and job rate and the things that hold them back, and how such details interrelate.

One of the better known ones (which, spoilers, deals with fewer guarantees than you really want, or probably think), is Brewer's CAP theorem[5], which hints at that of the three properties it makes, no matter your answer, you are always choosing a tradeoff.

CAP refers to:

  • Consistency usually means "do all nodes see the same data at the same time",
Consistency in CAP means is largely about ordering and timing
(CAP consistency mostly describes linearizability. This has little to do with ACID's consistency, which is more about "does it follow all the rules and restraints imposed (typing, relations, bad values, duplicates, etc.)")
partly "once new state is successfully accepted, we never send older state back"
partly agreeing which version is current
partly "we need to put things in the right order to interpret them properly"(verify)
  • Availability - you will get a response with reasonable time
a response that is data and not an error
but there is no guarantee that it is the most recent data
(note: an answer "regardless of things like node failure" implies that node failure does not halt the system)
  • Partition tolerance - system continues even with imperfect communication
no, 'partition tolerance' is not the best name for "what if messages go wrong?"
partitioning refers to "parts of a system that don't agree on state"
dealing with issues like
"some message got lost"
"some message arrived out of order"(verify)
"one host fell over due to hardware and then came back" (CAP doesn't mention this, or other real-world faults but you squint you can see this as it having missed a bunch of messages)
and often means the partitioned system needs to be resolved to become one whole again somehow.
...automatically if at all possible

When something partitions, we would like to qualify its behaviour, and CAP names are arguably quite useful to describe their tendencies in terms of the above properties.

But this can easily be misleading -- it makes it really tempting to equivocate the tendency with their more technical, stricter meaning (incorrectly, because any real-world system is going to be due to actual workings and fail in ways you now do not expect).

Some potentially misleading things

  • CAP is actually a much narrower thing than people think - technically it models a single piece of memory.
    • it says nothing about latency, which everyone NoSQL absolutely does care about.
    • it doesn't even talk about things that touch multiple objects
let alone a system as a whole
  • These three are entangled to start with, more so when you add "...within reasonable time" -- which CAP ignores, and you cannot.
The larger the scale and/or the lower the required latency, the more you lean towards accepting you trade away one of these things, in imperfect conditions or at all.

  • The way we usually explain these to each other, C and A are about properties that might stick around under perfect conditions, P is specifically about reasonable expectations under imperfect condition - partitioning.
which leads to this "trying to strongly guarantee two of these comes at the cost of the ability to guarantee the third so well" view
Some introductions suggest it's a "pick any combination of two". It's really not.
It doesn't become exactly two either. In terms of misbehaviour that you may want, all are a matter of degree.
  • The partitioning may be an exceptional thing, or may be a regular thing
a system may try and manage to be CAP almost always. If it never partitions, there is no good reason to give up CA in regular operation.

  • CAP only really considers the fault we call partitioning - and ignores other kinds of faults or how to deal with them (kind of a huge thing to leave up to implementer)
  • The way Availability is technically defined in CAP is actually not met by a lot of real world systems.
  • The way Consistency is defined is not the same as ACID's Consistency.

The tendencies mentioned above include:

  • focus on C+A is tends to be for consistency management, lock management, fastish transactioning
usually doesn't like partitioning - many barely handle re-syncing after a partition.(verify)
  • focus on C+P will under regular circumstances also provides Availability, but when partitioned does not provide Availability
but often basically says "pause until a partition stops happening, e.g. going read-only or unavailable"
so that it stays deterministic with a simpler model, and more easily rejoinable
there has been research to make that more flexible
can be useful for large things that act as a primary data store(verify)
  • focus on A+P is probably for speed at scale
under regular circumstances can provides C (or close enough)
during a partition, this will often let parts operate, with less care that nodes's data might go out of sync
to clients fetching things from different parts, it will be inconsistent
whether inconsistencies can be resolved, or whether this is just a hope, depends on a lot of details

See also


This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

BASE is a wink at ACID.

BAsically available - roughly meaning there will be a fairly quick response to all requests -- because that response is allowed to be

known-current state
recent-but-old state (so, when data interrelates, logically inconsistent state).
'try again later'

Soft state - you should assume the system might be catching up, even while there is no input.

Eventual consistency - The system will tend towards well-propagated consistency once it stops receiving input.

Eventual consistency

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

For context, RDBMSes do what you could now call immediate consistency in the sense of "if I acknowledge an alteration, it 's there in my store". Though this does not necessarily imply linearizability in the sense of "another operation will pick it up as soon as technically possible", it's usually very soon after, and well defined how soon after and why.

Eventual consistency usually means "I acknowledge I've received that -- I'll get to actually processing it soon."

Which may be very soon after indeed - it may propagate consistency whenever it's not busy receiving input.

In some cases, it is not very important how much later a thing is processed. In these cases, this makes it easier to speed things up and/or scale things up.

Yet even if we don't care about when, we may still care about what happens when we hit an issue.

If we hit a conflict (in the data sense), and we do not have a reasonable way to resolve a particular conflict, the only sensible decision may be to forget the data.

And if the acknowledgement was "I store it in RAM" and not journaled to disk, might be lost in a crash.

The benefit is that we can say yes earlier, not holding up this or other clients. The cost is fewer guarantees.

This might be a very reasonable cost/benefit thing e.g. in things used primarily as a caches , in that you usually don't have any issue, and it's perfectly sensible if updates appear slightly later.

It is less ideal in primary stores, in that while it means data is usually available in the short run, it may only be consistent (enough to not give weird results) in some longer run.

See also:

Broad qualifiers, a.k.a. thinking about what you want from your database



  • OLTP basically means systems with immediacy, OLAP often means bulk/batch processing
  • Systems can be made to do both to some degree, but OLTP often comes with people wanting it to be as immediate as possible
  • RDBMSes are mostly used for OLTP, NoSQL has less opinion about how it's used

You'll see these two as qualifiers, as kinds of setups, and sometimes as specializations of specific kinds of systems/databases.

Online transaction processing (OLTP) refers to any system that does day-to-day transactions

  • handles individual requests as they come in
  • as safely/serially as configured
  • as realtime as that still allows

This usually puts focus on

  • operations being simple and fast
because if things take longer than milliseconds, it's going to scale very poorly, particularly if done serially
  • consistency within systems - strongly where this is simple
  • consistency between systems - where applicable and to the degree this is possible

OLTP that uses more complex data models tends to be implemented in RDBMSes, because the consistency you then often want can be done for you. When you want scalability you often try to loosen that as much as you can get away with.

Online analytical processing (OLAP) tends to refer to "analyse a bulk of data" setups.

This often means that, mostly unlike OLTP

collecting chunks of data collected over a larger amount of time
often done for business reasons, rather than direct use by users.
classically was more about generating reports, analytics, etc. to be looked at later - There are few cases where a response coming seconds later is any problem (or days, if we're talking about reports)
now more frequently also to be looked at live (scale of seconds), and is starting to overlap with more complex variants of health monitoring
(compare each to OLTP, where responses are measured in milliseconds)

On mixing the two

When doing OLAP, you could just send batch-like queries to your one system that does everything, but when that system is a critical, live OLTP system, the bulk nature of OLAP style work could slow down OLTP due to its "let's be careful about reading when something is writing" nature.

As such, there is a habit to siphon off data from an database used for OLTP to a separate database used for OLAP database (and its own processing host).

There are now also Hybrid transactional/analytical processing (HTAP) is the attempt to do both a transactional database and analytics, which sometimes amounts to an automated variant of doing just that.

The more at-your-leisure your OLAP work, the more that classic ETL style processing covers you needs, but these days, there are OLAP systems that aim to do real-ish time monitoring, and more explicitly balance scale and speed (also with NoSQL-ish scaling).

It's often still factors slower than OLTP, or at least behind a few seconds (e.g. when running off a replicated database server)
yet still many factors faster than generic warehousing, "throw an ETL script at it once every so often", "produce a report" style things),

OLAP might be based on structured data from a central OLTP system you have, or it might ingest less-structured data (like logs) in which case terms like 'warehouse' or 'lake' turn up.

As a broad qualifier, OLTP and OLAP can describe

  • a system as a whole - goals, database, ode, the hosts used
  • a type of database - e.g.
RDBMS is a fairly classical case of OLTP
most NoSQL (e.g. kv and document stores) are still largely on the OLTP side (if only because few do much processing themselves)
e.g. ElasticSearch used for log analysis or monitoring (see e.g. logstash, kibana) is more on the OLAP side
  • specific queries
e.g. some complex queries may be OLAP
queries may get so complex that people do some pre-structuring

It also has some applications to architecture. OLTP often implies tiered applications, whereas OLAP modelings is much more varied, often somewhat custom (though OLAP NoSQL does exist), from monolithic ETL batch script to complex interdependent streaming setups.

Lakes, warehouses, swamps, pits, and graveyards

Data warehouse means data in a processed, probably actively useful form.

This term has been around since the seventies.

Data lake means little more than "'not a data warehouse"

It often means "we're tracking things, we'll see how it's useful later"

This frequently also means storing it in raw from, again because you don't know what you want to extract from it. (...though if the thing is by nature already a structured document, this is more about whether it's indexed than whether it's parsed)

In some cases, data lakes involve so little thinking ahead that

  • you don't know what you have
  • you will have a really tough time figuring how to use or combine or even find this data later
  • you created what you could creatively call
data swamps (inaccessible)
data pits (won't produce clear insights)
data graveyards (you're not using it, and you never will)


  • searching for 'Data lake' will mostly give you IoT related results -- mostly because IoT is very ill defined beyond "collect all the data (personal or not) and see what we can do with it later".

  • Data lake sometimes also includes the software that accesses the data - anything from ETL/ELT, and sometimes much fancier analytics.
more so of this is a company trying to sell you analytics software

  • In practice, though,
    • data warehouse and data lake are often used interchangeably.
    • (so) when data warehousing can mean both, then ETL and ELT are both seen in data warehousing


ETL (Extract, Transform, Load)

is what data warehousing historically implies:
"I clean up my data before entering it into the system" terms of data types, data validity, and change the shape of to fit a model/schema you have already settled, on at design stage
...often so that it can be stored in a structured database, meaning
you have specific goals in mind
so have already thought about efficient querying, and ensured faster loading
and you need to define that database up front
you can design it in a way that likely queries will be fast

ELT (Extract, Load, Transform) mostly means

"just store what we got, in the form we got it in (probably a text blob, or even a binary blob), and worry about interpreting it later, maybe never"
Data lakes are arguably defined by this "it is not yet structured, queryable data" ELT approach (contrasted to data warehouse, which is more ETL/usable)

This is in part pragmatism:

  • if there's data we don't use yet, and some we may never use, why spending time creating a model and polishing data before we can even store it - we can do that work if when we need it, if we ever want to
  • it doesn't need to have the form that the original design assumption require
  • we can capture any and all data, and might be able to ask more varied questions, and get more answers, if you keep the original data around.

However, the same thing that leads to these upsides (you didn't do anything yet) lead to downsides:

  • you will parse and validate it on use
...either on every use
this tends to be true for ad-hoc querying, which tends to mean mapreduce / pipeline solutions
this is usually also slower and/or consumes more power
...or on first use, e.g. if you validate-and-store into a working set for every question you work on
meaning you store even more data, and take more time (sort of because you're creating a warehouse for each use)

"NoSQL versus relational"

Lookups, joins, flatness, and locality

Responsibilities in database versus responsibilities in code

Why denormalize, why not denormalize?

ORMs considered good and bad

On consistency

The n+1 problem

Some such databases, by type

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

File databases

Not a model but a practicality:

File databases are useful to persist and store moderately structured data on disk,

often storing many items in one (or a few files)
usually accessible via a single, simple library - meaning they are also considered an embedded database

There is typically also some way to alter that data in terms of its items (rather than, as in serialization, the whole thing at once), which may be limited, or just not very performant.

The goal is often a mix of

  • persist a moderate set of possibly-somewhat-complex data without having to
do the low-level storage
think about efficient lookup so much
(demand of users to) install and configure a database engine
think about such an database's networking, auth (because there is none)
  • (because it need only) support only a single program/client
  • relatively read-heavy
  • avoids requiring an external process
that has to be running and someone has to start and be responsible for

key-value file databases

The simpler variants may be based somewhat on the older, simple, yet effective dbm, a (family of) database engine(s) that stores key-value mappings (string to string) in a file (sometimes two, splitting out some metadata/index(verify)).

These may be understood as an on-disk variant of a hashmap, allows rebucketing to have the hashing scale well, and fixed-size buckets to allow relatively efficient modification.

They are also libraries, so easy to embed and avoid network serving details.

dbm family

Within the dbm family, the ones that are interesting enough to use are probably roughly:

  • dbm, 'database manager' (1979, AT&T)
  • ndbm, 'new database manager' (1986, Berkeley), added support for the library to deal with multiple open databases
    • sdbm, a 1987 clone of ndbm, made for licensing reasons
  • gdbm, 'GNU database manager', also added arbitrary-length data
  • Berkeley DB (a.k.a. BDB and sometimes BSD DB), optional concurrent access, transactions and such (ACIDity), etc.

There are also continuations / successors of this idea, including

  • Tokyo Cabinet and related are basically a modern reimplementation of the dbm idea
  • tdb (trivial database library): made for samba suite (so relatively recent), API like GDBM but safely allows concurrent writers [6]
  • tdbm: variant of ndbm with atomic transactions, in-memory databases
not to be confused with the ORM called "the database machine" and also abbreviating to tdbm
  • MDBM: memory-mapped key-value database store derived (like sdbm and ndbm) [7]

Berkeley DB notes

Berkeley DB, also known as BDB and libdb, is basically a key-value map in a file.

It is a library instead of a server, so can be embedded, and is used like that quite a bit.

For simpler (e.g. not-very-relational) ends it has lower and more predictable overhead than bulkier databases.

Technical notes

There is a low-level interface, that does not support concurrent write access from multiple processes.

It also has some higher-level provisions for locking, transactions, logging, and such, but you have to choose to use them, and then may want to specify whether it's to be safe between threads and/or processes, and other such details.

From different processes, you would probably use DBEnv(verify) to get BDB to use proper exclusion. Most features you have to explicitly ask for via options - see things like DBEnv::open() (e.g. DB_THREAD, (lack of) DB_PRIVATE), and also notes on shared memory regions.

Interesting aspects/features:

  • it being a library means it runs in the app's address space, minimizing cross-process copying and required context switches
  • caching in shared memory
  • option for mmapped read-only access (without the cache)
  • option to keep database in memory rather than on disk
  • concurrent access:
    • writeahead logging or MVCC
    • locking (fairly fine-grained)
    • transactions (ACID), and recovery
  • hot backups
  • Distribution:
    • replication
    • commits to multiple stores (XA interface), (since 2.5)

Both key and value are byte arrays; the application has to decide how it wishes to format and use data.

Both key and value can be 232 bytes (4GB, though for keys that's usually not a great idea).
A database file up to 248 bytes (256TB, which is more than various current filesystem limits).

It uses a cache to avoid lookup slowness, and a write-back cache to be more write-efficient.

Format/access types

There are multiple types of access / file format. They provide mostly the same functionality (keyed access as well as iteration over the set); the difference mostly in performance, and only when the data is large, since if all data fits in the cache, this is an near-non-issue.

For larger data sets you should consider how each type fits the way you access your data.

If your keys do not order the entries, you should consider hash or btree. When keys are ordered record numbers, you should probably go with recno, a.k.a. record, (fixed or variable-length records).

You can supply your own comparison and hash functions.

More details:

  • Hash (DB_HASH)
    • uses extended linear hashing; scales well and keeps minimal metadata
    • supports insert and delete by key equality
    • allows iteration, but in arbirtary order
  • B+tree (DB_BTREE)
    • ordered by keys (according to the comparison function defined at creation time. You can use this for access locality)
    • allows lookup by range
    • also keeps record numbers and allows access by them, but note that these change with changes in the tree, so are mostly useful for use by recno:
  • recno (DB_RECNO)
    • ordered records
    • fast sequential access
    • also with key-based random access - it is actually built on B+tree but generates keys internally
  • queue (DB_QUEUE)
    • fixed-size records
    • fast sequential access

You can also open a BDB using DB_UNKNOWN, in which case the open call determines the type.

There are provisions to join databases on keys.

Versions and licenses
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
this is probably incorrect in parts
it's hard to find good details

Early versions came from a library for simple hash, and later b-tree records.

  • BDB 1.85 was a version for BSD 4.4, released in 1992 under a BSD-like license (that makes it behave like the GPL)

A Netscape request for further features led to the creation of Sleepycat Software in 1996

  • BDB 2.1 was released in 1997, adding concurrent access, transactions

Sleepycat Software was acquired by Oracle Corporation in 2006. This seems to have had little licensing consequences on versions since then (verify)

Versions 2 and later, including the Oracle additions, are dual-licensed, under either:

  • The Sleepycat Commercial License: A purchased license that does not force redistribution
  • The Sleepycat Public License: software that uses Berkeley DB must be distributed (note: excludes in-house development)
    • apparently, the case of language support/wrapping is a special one (not unlike LGPL) as in cases like python and perl interfaces, the software that uses BDB is the language, not any scripts that use that interface/BDB. This does not seem to apply in the case of C library use(verify).

In other words, you generally have three options:

  • use freely while the application is and stays entirely personal / internal to your business
  • you have to distribute the source of the application that uses BDB
  • you have to get a paid license

There are now three products:

  • Berkeley DB (the basic C library)
  • Berkeley DB Java, a Java version with options for object persistence
  • Berkeley DB XML, an XML database using XQuery/XPath (see e.g. [9])

Some additional licenses apply to the latter.

See also

Tokyo Cabinet / Kyoto Cabinet

database library, so meant for single-process use.

Tokyo Cabinet (2007) (written in C) is a embedded key-value database, a successor to QDBM

  • on-disk B+ trees, hash tables, or fixed-length array
  • multithreaded
  • some transaction
  • no real concurrent use
process-safe via exclusion control (via file locking), but only one writer can be connected at a time
threadsafe (meaning what exactly?)

Kyoto Cabinet (2009) is intended to be a successor.

written in C++, the code is simpler than Tokyo, intends to work better around threads. (Single-thread seems a little slower than Tokyo)


  • Tokyo may be a little faster
  • Tokyo may be a little more stable (at leas in earlier days of Kyoto's development)
  • Kyoto may be simpler to use
  • Kyoto may be simpler to install

LightningDB / LMDB

Lightning Memory-Mapped Database, a.k.a. LightningDB (and MDB before a rename)

  • Ordered-map store
  • ACID via MVCC
  • concurrency


This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
  • cdb (Constant DataBase), basically a fast on-disk associative array [10] [11]

Also: some new things

e.g. LevelDB and RocksDB, on this page filed under kv store

Relational file databases

SQLite notes

SQLite is a library-based database engine.

No server, no admin rights required to install or start it, a database it works on is a single file (with some helper files that appear temporarily but you don't have to think about), and there is no configuration (well, per-file database configuration).

It's not geared to be multi-client, but with some care you do get that to some amount.

It's a little fancier than similar library-only databases, in that it e.g. supports

  • most of SQL92
  • indices
  • ACIDity
  • recovery via journaling
  • constraints
  • views (to a degree)
  • concurrent access (to a degree)

Cases where SQLite is interesting:

  • embedded things - SQLite and BDB have both been fairly common here for a good while
  • giving an application memory between runs
without reinventing various wheels
  • storage for simple dynamic websites (without needing a database, though on shared hosting this is usually just there anyway)
particularly if mostly read-only
  • interchange of nontrivial data between different programs / languages
(that doesn't need to be minimum-latency)
  • when accessed via a generic database interfaces, you can fairly easily switch between sqlite to real rdbms
e.g. during development it's useful to do tests with an sqlite backend rather than a full installed database
  • Creative uses, such as local caching of data from a loaded central database
note you can also have memory-only tables

Limitations, real or perceived:

with some caveats about filesystems that don't support that, particularly some network mounts)-->
  • many-client cases: while sqlite will function with multiple users, it will perform better with fewer users
...though concurrency is better since WAL, i.e. since version 3.7
  • no foreign keys before 3.6.19 (can be worked around with triggers) and they're still turned off by default
  • triggers are somewhat limited (verify)


  • it is dynamically typed, meaning the type can vary per row. This is probably not your initial expectation, and you need to know what it does.
this is sometimes convenient
and sometimes weird (certainly unlike most RDBMS) and a reason you need to wrangle it to act consistently
  • some cases of ALTER TABLE basically require you to create a new table (this is true for various larger database systems too, they just do it more automatically)
  • no permissions (...with the reason that it's fairly pointless if you can read the file anyway)
  • the less-common RIGHT JOIN and FULL OUTER JOIN are unimplemented (verify)
but: they're not used much, and you can rewrite queries to work


String or BLOB exceeded size limit seems to mean that one or both of:

  • a query size exceeded SQLITE_MAX_SQL_LENGTH (1MB by default)
you can avoid this by using parametric values (binding) instead of literal query strings
  • a row, text field, or blob field exceeded SQLITE_MAX_LENGTH (1GB by default)

(possibly other limits)

You can set these these while compiling. SQLITE_MAX_LENGTH may be 2^31-1, and SQLITE_MAX_SQL_LENGTH may be no larger than SQLITE_MAX_LENGTH (and possibly also no larger than 1GB?(verify) -- not that wanting a 2GB query is very sane...

Multiple processes

Multiple processes can read the same database. Only one can write, as attempts to write will locks the entire database file. It's up to clients to decide to error out complaining about that lock, or wait and retry.(verify)

Note that locking will not work over NFS, and there are issues with SMB / windows file sharing you probably want to know about.

Note you should not carry an open connection though a fork.

Say the writers: "SQLite is not designed to replace Oracle. It is designed to replace fopen()"

Multiple threads

SQLite can be compiled to be thread-safe, and usually is (the reason you can compile it without that is that that removes mutexes and is slightly faster for cases you know will always be single-threaded) You can query that compile option from the database if you want to be sure.

This page points out there is a further distinction between multi-threaded (where threads must not share a connection) and serialized (where that appears to be fine).

Some libraries may have their own decisions - e.g. python3's sqlite3 will complain SQLite objects created in a thread can only be used in that same thread though that seems to be overzealous

On SQLite typing
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

This page mentions a lot of what you want to know.

For contrast: where most RDBMSes are statically typed (converts data to the column's type on storage, fails if it can't)...

...SQLite is more like dynamic typing in that SQLite decides the type from the value you give it, and only somewhat cares about the according column's type.

A little more precisely, the schema defines a type affinity, not a rigid type.

'type affinity' here meaning 'the value may still be of any type, but we prefer this one if that works'

The SQL types map to the affinity of INTEGER, REAL, NUMERIC, TEXT, BLOB, or NULL

e.g. INTEGER is split into seven specific-sized things -- on disk, anyway; in memory all are loaded into int64(verify)
TEXT is stored as UTF-8, UTF-16LE, or UTF-16BE (according to the database encoding) (verify)

For example(verify)

  • if you store into a numeric column, it will do something like
    • if not well formed as an integer or real, it's stored as TEXT
    • if it's a real number (with digit) and store as REAL if (seems like float64 so if there was more precision in the text that will be lost)
    • if it's an integer, and find the smallest integer type that will store it
    • if it's an integer larger than int64 can store, it will try REAL
    • hex is stored as text

Assume that the schema is not used for conversions going out

AFAICT, what you get out is whatever type that got stored, and it is the logic on insert that is interesting. least, that's what synamically typed languages tend to do with it. Statically typed languages (including SQLite's onw API) may get you to convert it, and/or makes you test for it(verify).

Array oriented file databases


This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

CDF (Common Data Form) is a scalable, primarily array-oriented format, which seems to have been designed to use and to archive large amounts of sensor data, with it being self-described, appendable, with one writer and multiple readers. bears some similarities to HDF5 in intent, and is found in similar contexts.

In fact, NetCDF4 chose HDF5 as a storage layer under the covers, so libraries could easily deal with both CDF and HDF5 (though CDF data representation, especially classic, is more restricted in practical use).

In different contexts it may refers to an abstract data model, an API to access arrays, a data format, or a particular implementation of all of that.

This is a little confusing, in that some of them have less constant over time. In particular, the data format for netCDF can be roughly split into

  • classic (since 1989)
  • classic with 64-bit offsets (since 2004)
  • netCDF-4/HDF5 classic (since 2008)
  • netCDF-4/HDF5 (since 2008)

The first three are largely interchangeable at API level, while the last, also called enhanced, allows more complex data representations that cannot be stored in classic.

See also:

And perhaps

Hierarchical Data Format (HDF)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Now typically meaning HDF5 (though HDF4 still sees some use).

Describes a data model, library, and file format.


  • hierarchical refers to the fact that addressing-wise, it basically implements filesystem-like names within it
  • Stores various array-like stuff, and halfway clever about readout of parts from huge datasets.
  • structured, potentially complex data
primarily cases where you have many items following the same structure, and often numerical data (but can store others)
think time series, large sets of 2D or 3D imagery (think also meteorology), raster images, tables, other n-dimensional arrays
  • fast lookup on predictable items
offset lookups in arrays, B-tree indexes where necessary; consideration for strides, random access, and more.
a little forethought helps, though
  • portable data (a settled binary format, and self-described files)
  • dealing with ongoing access to possibly-growing datasets
  • parallel parallel IO has been considered
multiple applications accessing a dataset, parallelizing IO accesses, allows is use on clustered filesystems

See also

Apache Arrow

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Arrow is primarily an abstraction on working with certain types of data.

Arrow seems made to not be too in the way when passing around larger amounts of data,

thinking about keeping things relatively light in-memory
zero-copy where shared memory is possible,
where copies are necessary, streams in chunks,
fast to iterate and and preferably O(1) for random access (though note the latter doesn't hold for various serialisations)
thinks about parallel IO

It prefers to work with column-style data, and lend itself to table-like things though more technically is hybrid columnar, there's no reason it won't store other types of data.

Arrow is not a database.

Don't assume you can alter data efficiently.

Arrow is not a serialization/interchange/storage format.

...there are multiple, with distinct properties. And yes, this makes "arrow file" confusingly ambiguous.

  • The IPC/feather format (uncompressed, more about faster interchange)
    • in streaming style OR
    • in random access style
optionally memory mapped (and can be by multiple processes)
  • Parquet (compressed)
more about storage-efficient archiving
much less about efficient random access
(you can also load more classical formats like CSV and JSONL but lose most of the efficient lazy-loading features)

There is also pyarrow.dataset which

  • read or write partitioned sets of files
    • can understand certain (mostly predefined) ways you may have partitioned data into multiple files
    • do lazy loading, which can help RAM use (even on large Parquet, CVS, and JSONL)
  • can selectively load columns
  • and filter by values
  • Can load from S3 (and MinIO) and HDFS

Unsorted (file databases)

Other/similar include:


Tokyo (and Kyoto)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
Tokyo Tyrant / Kyoto Tycoon

database server, so useful when you want some concurrent use.

Supports expiry, so can act much like memcached

Tokyo Dystopia

full-text search system

Tokyo Promenade

content management system built around cabinet.

Presentable in BBS, blog, and Wiki style


Which do I use?

Storagey stuff - kv or document stores

Ask for a key, get a blob of uninterpreted data, or map of somewhat structured data.

MongoDB notes


  • Weakly typed, document-oriented store
values can be lists
values can be embedded documents (maps)
  • searchable
on its fields, dynamic
e.g. db.collname.find({author:"mike"}).sort({date:-1}).limit(10) - which together specifies a single query operation (e.g. it always sorts before limiting [12])
supportable with indices [13]
field indexes - basic index
compound indexes - indexes a combination, e.g. first looking for a userid, then something per-userid)
multikey indexes - allows matching by one of the values for a field
2d geospatial - 'within radius', basically
text search
indexes can be:
hash index - equality only, rather than the default sorted index (note: doesn't work on multi-key)
partial index - only index documents matching a filter
sparse index - only index documents having that have the field

  • sharding, replication, and combination
replication is like master/slave w/failover, plus when the primary leaves a new primary gets elected. If it comes back it becomes a secondary to the new primary.
  • attach binary blobs
exact handling depends on your driver[14]
note: for storage of files that may be over 16MB, consider GridFS

  • Protocol/format is binary (BSON[15]) (as is the actual storage(verify))
sort of like JSON, but binary, and has some extra things (like a date type)
  • Not the fastest NoSQL variant in a bare-metal sense
but often a good functionality/scalability tradeoff for queries that are a little more complex than just than key-value
  • no transactions, but there are e.g. atomic update modifiers ("update this bunch of things at once")

  • 2D geo indexing
  • GridFS: chunking large files and actually having them backed by mongo.
point being you can get a distributed filesystem

  • mongo shell interface is javascript

Schema considerations

  • One to one relationships? You probably want it in the same document. For most It saves extra lookups.
E.g. a user's address
  • One to many? Similar to one-to-many.
E.g. a user's address when they have more than one.
  • Always think about typical accesses and typical changes.
For example, moving an entire family may go wrong because values have to change in many placers. (but then, it often might in RDBMS too because a foreign key would have to change in many places)
  • foreign-key-like references can be in both places, because values can be lists and queries can search in them
Usually, avoid setups where these lists will keep growing.
  • when you refer to other documents where contents will not change, you could duplicate that part if useful for e.g. brief displays, so you can do those without an extra lookup.
  • sometimes such denormalized information can actually be a good thing (for data-model sanity)
e.g. the document for an invoice can list the exact text it had, plus references made. E.g. updating a person's address will not change the invoice -- but you can always resolve the reference and note that the address has since changed.
  • if you know your product will evolve in production, add a version attribute (and add application logic that knows how to augment previous versions to the current one)

Also there's a document size limit

You can set _id yourself, but if you don't it'll get a unique, GUID-like identifier.

See also:

GUI-ish browsers:

riak notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Key-value store with a focus on concurrency and fault tolerance,

Pluggable backends, e.g. allowing use as just a memcache, or give it persistence.

Eventually consistent (with some strong-consistency experiments?(verify))


Ideally you fix the cluster size ahead of time. When you add nodes, contents are redistributed (verify)

Backends include

  • bitcask
all keys in hashtable in RAM (fast, but limiting the amount of items via available RAM)
file copy = hot backup (verify)
  • leveldb
keys stored on-disk
secondary indexes, so limited limited relational-style querying at decent performance
data compression
no hot backup
  • innostore
  • memory
objects in ram

It aims to distribute perfectly (and supports some other features by assumptions), which implies you have to fix your cluster size ahead of time

Pluggable backends mean you can have it persist (default) or effectively be a distributed memcache


Key-value store, distributed by using Raft consensus.

It is arguably a Kubernetes / Google thing now and has lessening general value, in part due to requiring gRPC and HTTP/2, and threatening to abandon its existing API.

CouchDB notes

(not to be confused with couchbase)

Document store.

Made to be compatible with memcachedb(verify), but with persistence.

  • structured documents (schemaless)
can attach binary blobs to documents
  • RESTful HTTP/JSON API (to write, query)
so you could do with little or no middle-end (you'll need some client-side rendering)
  • shards its data
  • eventually consistent
  • ACIDity per doment operation (not larger, so inherently relational data)
no foreign keys, no transactions
  • running map-reduce on your data
  • Views
best fit for mapreduce tasks
  • Replication
because it's distributed, it's an eventually consistent thing - you have no guarantee of delivery, update order, or timeliness
which is nice for merging updated made remotely/offline (e.g. useful for mobile things)
and don't use it as a message queue, or other things where you want these guarantees
  • revisions
for acidity and conflict resolution, not in a store-forever way.
An update will conflict if someone did an update based on the same version -- as it should.
  • Couchapps,

document ~= row


  • view group = process
nice way to scale
  • sharding is a bit harder


  • not in views
  • if large, consider CDNs, a simpler nosql key-val store, etc.

See also:


Javascript analogue to CouchDB.

Made in part to allow storage in the browser while offline, and push it to CouchDB later, with minimal translation.

Couchbase notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

(previously known as Membase) (not to be confused with CouchDB)

CouchDB-like document store, plus a memcached-compatible interface

Differences to CouchDB include:

  • typing
  • optional immediate consistency for individual operations
  • allows LDAP auth
  • declarative query language
  • stronger consistency design


Distributed key-value with optional transactional API

See also TiDB, which is basically an SQL layer on top that makes it NewSQL.


Mixed-model but seems built on (ordered) kv.

Networked, can be clustered (replication, partitioning).

Does transactions, serializable isolation, so actually tries towards ACIDity.

Tries to avoid some issues by not allowing transactions to run more than a few seconds, exceed 10MB of writes, and other limitations like that.


On-disk key-value store

License: New BSD


A Facebook fork of LevelDB, focusing on some specific performance details

License: Apache2 / GPL2


Column store


key-value or document store.

stronger consistency guarantees than many other things

supports transactions on more than a single object - making it more ACID-style than most any nosql. By default is looser, for speed(verify)

very interesting performance

optionally applies a schema

Most parts are BSD license. The warp add on, which provides fault-tolerant transactional part, is licensed. The evaluation variant you get omits the fault-tolerance guarantee part.


Document store with a push mechanism, to allow easier/better real-timeness than continuous polling/querying.

Storagey stuff - wide-column style

As wide column is arguably just a specialized use of a kv store


This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Distributed wide-column store

Uses CQL for its API (historically that was Thrift)

Seems to scale better than others, though apparently at cost of general latency


People report (cross-datacenter) replication works nicely.

Values availability, with consistency between being somewhat secondary NOTSURE (verify)

(if consistency is more important than availability, look at HBase instead)

See also:


This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

An implementation imitating Google's Bigtable, part of Hadoop family (and built on top of HDFS).

See also:

  • stored on a per-column family basis


This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

See also:


This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Storagey stuff - graph style

This one is mostly about the way you model your data, and the operations you can do, and do with fair efficienct. that you can use e.g. key-value stores in a graph-like ways, and when you don't use the fancier features, the two may functionally be hard to tell apart.


Apache Giraph







Cachey stuff

redis notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


  • key-value store
with things like counting
  • typed - types/structures include: counter, list, set, sorted set (hash-based), hash, bitarray (and some geo data types)
that you get a basic set of queries/operations for
  • allows transactions, which lets you do your own atomic updates when necessary
(completely separate from the keyspace)
in-client mapping (so must be consistent if you use it as a store rather than a cache!)
  • primarily in-memory
though allows persisting by
logging requests for recovery that way (AOF), and/or
snapshotting current state, e.g. every hour (RDB)
...but both are still more of a persistent cache than a primary store


  • a memcache of sorts (session data, common resources)
  • intermediate between services (see pub/sub)
  • rate limiting between

On commands:

  • most/all types
SETNX - set only if did not exist
SETEX - set value and expiration
  • integer / counter
  • float value
  • bytestrings
GETRANGE (substring)
  • bitstrings
  • list
LRANGE (fetch by index range)
LTRIM, RTRIM (throw away al)
BLPOP, BRPOP - block until present (until timeout, waiting clients served in order, can wait on multiple lists -- all behaviour useful for things like producer-consumer pattern)
RPOPLPUSH, BRPOPLPUSH - may be more appropriate when building queues
  • hash
HMSET, HMGET (multiple)
  • transactions:
MULTI, ended by either EXEC or DISCARD


memcached is a memory-only key-value store

  • evicts items
by least use when memory limit is hit (LRU style, geared to keep most-used data)
by explicit eviction time, if given
  • client can shard to multiple backends - an approach that has some some footnotes
  • no persistence, no locks, no complex operations (like wildcard queries) - which helps it guarantee low latency

It is not:

  • storage. It's specifically not backed by disk
  • a document store. Keys are limited to 250 characters and values to 1MB
if you want larger, look at other key-value stores (and at distributed filesystems)
  • redundant
You are probably looking for a distributed filesystem if you are expecting that. (you can look at memcachedb and MogileFS, and there are many others)
  • a cacheing database proxy
you have to do the work of figuring out what to cache
you have to do the work of figuring out dependencies
you have to do the work of figuring invalidations

Originally developed for livejournal (by Danga Interactive) and released under a BSD-style license.

See also

Searchy stuff


See also Lucene and things that wrap it#ElasticSearch

Note that it is easy to think of Lucene/Solr/ES as only a text search engine.

But it is implemented as a document store, with basic indices that act as selectors.

So it is entirely possibly to do more structured storage (and fairly schemaless when ES guesses the type of new fields), with some arbitrary selectors, which make it potentially quite useful for for analytics and monitoring.

And since it adds sharding, it scales pretty decently.

Sure, it won't be as good at CRUD operations, so it's not your primary database, but it can work great as a fast structured queryable thing.

Time seriesy

Time series databases are often used to show near-realtime graphs of things that happen, while also being archives.

They are aimed at being efficient at range queries,

and often have functionality that helps


Is now part of a larger stack:

InfluxDB[18] - time series database
Telegraf[19] - agent used to ease collecting metrics. Some pluggable input / aggregation/ processing things
Kapacitor[20] - streams/batch processing on the server side
Chronograf[21] - dashboard.
also some interface fto Kapacitor, e.g. for alerts
Often compared to Grafana. Initially simpler than that, but more similar now

Flux[22] refers to a query language used in some places.

InfluxDB can be distributed, and uses distributed consensus to stay synced.

Open-source, though some features (like distribution) are enterprise-only.

Data model notes
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

In comparison to a more relational view...

  • database is a logical container for
retention policies
time series data
continuous queries

  • retention policy (RP) contain:
replication factor (copies kept in the cluster) (autogen's default is 1)
retention - how long to keep the data (min 1h) (autogen's default is infinite)
shard group duration - how much data is stored in shards (min 1h) (autogen's default is 7d)
measurements - each measurement is implicitly part of the retention policy you put it in
each database can have one or more RPs
you get a default called autogen (defaults mentioned above)
you'll quickly notice them in addressing (testdb.autogen.measurementname) though can ignore everything about them at first

  • measurement are like a table, containing tags, fields and points
  • series
basically refers to a (measurement,tag) combination you'd likely use in querying -- see below
  • tags are key-value pairs (column-like) you can add per datapoint that are
part of a series' uniqueness, and is indexed
basically, whatever you need for lookup
limited to 64K (and you probably don't want to get close without good reason)
  • field are key-value pairs (column-like) you can add per datapoint that are
not part of its uniqueness, not indexed
basically, any values you may be looking up
types include float (32-bit), integer (64-bit), boolean, timestamp, or string
a measurement takes the type of the first value it gets (and currently cannot be changed except with some creativity[23]), so e.g. forcing integer (add i) over float is sometimes necessary, e.g. to store large values without losing precision
keep in mind that a field will not take different types over time, even if it might be fine, so being consistent per measurement is a good idea. You'll see errors in the log like field type conflict: input field "ping" on measurement "net" is type integer, already exists as type float
strings possibly not limited to 64K? (I've seen conflicting information)
but you probably don't want to use influxdb as a blob store if you want it to stay efficient
you can check current type with show field keys
  • (data) points

always have a time(verify)
there is always an index on time
time precision is another detail

Typical use of measurements, series, tags

Say you want to start keeping track of CPU use and are collecting it for various datacenters (various tutorials use an example like this).

You might have a

  • specific database for this purpose (for admin reasons)
  • retention policy mostly because you want monitoring stuff deleted after a year without thinking about it
  • measurement called host_monitor

and want to enter a datapoint with

  • tags like hostname=node4,datacenter=berlin,country=de
  • fields like cpu0=88,cpu2=32

You'll notice this is a pile of everything CPU-related.

The tags are structured with common uses in mind, often the coarsest and finest things you anticipate querying on - you can e.g. pick out and e.g. average per country, or pick out a particular host if needed (hostnames only unique within datacenters - i.e. combined with other tags).

Series are basically the idea that each unique combination of (measurement,all_tags) represents a series.

Data you send in from different places will often imply unique series, through having unique tags, though to some degree they are more of a querying concept, and a storage one only insofar that the indexing helps that.(verify)

On point uniqueness

A point is unique (measurementname,tagset,timestamp), so if you write when a record with that tuple already exists, field values are merged/overwritten.

Depending on the timestamp precision you hand into the ingest url, this

  • may never happen, if it's based on 'now'
  • may never be an issue, if the precision is much higher than the interval you send in
  • may be something you do intentionally

On timestamp precision

Timestamps are currently nanosecond resolution by default. This can be reduced to microsecond, millisecond or second.

Lower-precision timestamps lead to

  • a little more on-disk compression[24]
  • overwrites data with the same timestamp (see previous point)


  • Is that per database, series, datapoint at insertion time?
  • Does it mix precision if you alter precision over time?


/query queries, management /write ingest, takes line protocol

/ping health and version

/debug/pprof Generate profiles for troubleshooting /debug/requests Track HTTP client requests to the /write and /query endpoints /debug/vars Collect internal InfluxDB statistics

The line protocol[25] is a one-liner text presentation that looks like

measurement,tag_set field_set timestamp


tag_set and key_set are comma-separated key=val pairs
timestamp is nanosecond-precision Unix time
(also optional; defaults to local timestamp, UTC, but be aware of
clock drift (so you likely want NTP)
timezones (so have a serious think about using either client time or server time))

Clients may ease conversion of structured data to line protocol.

On shards and shard groups

Under the hood, data is stored in shards, shards are grouped in shard groups, shard groups are part of retention policies.

This is under-the-hood stuff you don't really need to know, though it may be useful to consider in that shard groups are related to

  • the granularity with which old data is removed because of retention policy (it's dropped in units of shard groups - so never immediately)
  • efficiency of the typical query, e.g. when most queries deal with just the last one and the rest is rarely touched, essentially archived and rarely or never touched by IO
Config notes

Logging is great for debugging, but is both verbose and (unless all its clients POST instead of GET) puts possibly-sensitive information in system logs.

In which case you probably want to set log-enabled = false in [http]

Note that if you put a HTTP server in front, logging same may apply there as well.

If apache, look at something like

 SetEnvIf Request_URI "/submit" dontlog 
 CustomLog /var/log/apache2/access.log combined env=!dontlog

Security notes

Host security

Because it's designed to be clustered, it serves on all interfaces by default (and names should be resolvable).

On a single-node installation you could choose a localhost-only via bind-address in the [http] section, which you'd want as

bind-address = "localhost:8086"    # default is ":8086"

(There used to be an admin web interface on port 8083, but this has been removed[[26]]. You now probably want to use Chronograf)


For similar reasons, by default there is no authentication[27]

you may wish to firewall things at IP level
if you want auth, you need to enable security and create users (see below)
auth happens at HTTP request scope, e.g. for the API and CLI
certain service endpoints are not authenticated
  • can do HTTPS itself [28]
you can get server certs and client certs - and use self-signed ones if you wish
note that in a microservice style setup, you may wish to do this on the edge / ingest sides instead

User authentication

Enable: set auth-enabled=true in the [http] section and restart

You can

  • use basic auth
  • hand in username and password in the URL or body
  • JSON Web Tokens

If nonlocal, it's recommended you use HTTPS, because all of these options are effectively plaintext.

User authorisation

New non-admin users have no rights. They can be given

ALL (meaning READ and WRITE)

per database

New admin users have a lot more granularity, like

  • user management
Querying notes

Query languages

InfluxQL - an SQL-like language [29]

Flux - a more featured language [30]

InfluxQL examples (you may want to run the cli, e.g. influx -precision rfc3339 where that argument is for human-readable time formatting):

USE testdb

A simple query would be

SELECT "eth0_rx", "eth0_tx" FROM "pc_monitor"

would be all data of that series that we have

Queries like

SELECT "cpu_used" FROM "pc_monitor" WHERE time > now() - 15m

but when you want a timeseries you often want to regularize it like:

SELECT mean("cpu_used") FROM "pc_monitor" WHERE time > now() - 15m GROUP BY time(1m) fill(null)

This adds a time interval (1m), what to do with multiple values (aggregate into the mean)

On fill(): GROUP BY time() creates regular intervals(verify), so it has to do something for intervals with no data. Options:

null: return timestamp with null value (default)
none: omit entry for time range
previous: copy value from previous time interval
linear: linear interpolation


Getting the most recent value


SELECT last("cpu_used") FROM "pc_monitor" WHERE time > now() - 1h


last() aggregate is what it sounds like
you want a time limit, to avoid selecting the entire time series for that to
you probably want that anyway, when you care to view something current

For a gauge, you may want a recent average, like:

SELECT mean(cpu_used) FROM "pc_monitor" WHERE time > now() - 5s group by time(5s) ORDER BY time desc


  • because this compares against now, the most recent interval that GROUP creates may not have a value in it yet, meaning you'll get

SELECT LAST(eth0_tx) from pc_monitor

SELECT LAST(field_name), * from test_result GROUP BY *

GROUP BY * effectively separates by series

SELECT LAST(*) from pc_monitor group by *

Practically similar to

SELECT * FROM "pc_monitor" WHERE time > now() - 15s ORDER BY time desc limit 1

Though you may like some averaging, like

SELECT mean(cpu_used) FROM "pc_monitor" WHERE time > now() - 5s group by time(5s) fill(none) ORDER BY time desc limit 1

Note that without the ORDER BY time desc limit 1 you'ld probaly get two time periods (at least until the group time is at least twice the selection time)

Dealing with null

See also:

Management notes

Deleting data

If this is about removing too-old data, the never-think-about-it approach is to set up retention policies.

...but yes, you can do things like:



all data and series from a measurement

DROP SERIES FROM "net" WHERE hostname='myhostname'


drops all series that apply

DELETE FROM "net_monitor" WHERE hostname='myhostname' and time < now() - 1h


the delete granularity is effectively measurements (not tags) (verify)
this won't delete the series, even if it removes all points

DROP SHARD shardid


you'd probably get the shard id from show shards

Browsing data

Use the CLI, something like something chronograf or grafana.

There used to be an interface [ this?]

CLI example:

> use foo
Using database monitor
> show series

Backup and restore

influxd backup -database name -portable

Storage size

Because of the compression done to older data, and the often-quite-compressible nature of time series, most monitoring needs don't really need to worry about space use.

This of course does scale with the amount of counters, and the time resolution of insertion.

For example, in a 70 day test with dozens of counters inserting in intervals between 2 to 300 seconds, space sawtoothed (because the compression is staged) up 200MB, from 500ish to 700ish.

Not nothing, and on embedded you probably still want rrdtool, but also nothing to worry about on, say, raspberries or small VPSes.

Chronograf notes

Separate install and binary, so needs to be pointed at a InfluxDB instance


Tied to Postgres



See Data_logging_and_graphing#Graphite_notes

Storagey stuff - NewSQL

NewSQL usually points at distributed variants of SQL-accessed databases.

So things that act like RDBMSes in the sense of

giving of SQL features (not just a SQL-looking query language),
providing transactions,
and as ACID as possible while also proviiding some amount of scaling.

NewSQL is in some ways an answer to NoSQL that says "you know we can have some scaling without giving up all the features, right?"


  • distributed SQL
PostgreSQL API
Cassandra-like API


MySQL interface
tries to be OLTP and OLAP
handles DDL better?
in spired by Google Spanner
see aksi TiKV - TiDB is basically the SQL layer on top of TiKV (verify)



Unsorted (NewSQL)

Aurora (Amazon-hosted only)

MySQL and PostgreSQL API to something a little more scalable than those by themselves


basically BigTable's successor


See also: