Varied databases

From Helpful
Jump to: navigation, search
Pieces of larger or complex systems
Varied databases
Message brokers/queues; job management
Locking, data versioning, concurrency, and larger-scale computing notes
Database related

More theoretical

Practical notes


Describing database systems

On database types

Block store

Basically just means access to what now is often a large disk (historically tape) at low, level.

In moderately-sized blocks at a time.

The most 'do everything yourself' variant of storage. Utilities like tar ('tape archive') add just enough to distinguish different files, but basically nothing more.

It's relevant as an intermediate layer, e.g. VMs will want their storage on a block device. This itself is often handled through another layer of logical volumes

There are basically no use cases where users or services want a block device directly.

Object store

Object stores often mean storing arbitrary-sized blobs of data, under arbitrary names.

A blob store, today, usually refers to object stores -- though also, 'Binary Large OBject' still has relation to doing this in databases where it was perhaps less of a fit.

Like a key-value store in many ways:

  • very basic basic operations: mostly store, fetch, and delete, and maybe list
  • no structure to the names


  • object stores may allow large objects (gigabytes or more), kv may limit values to a few kilobytes
roughly so that you use kv for short term, OLTP-ish stuff, and object stores for bulk storage and archiving stuff
  • object stores are more likely to have list operation, where kv may not
  • object stores usually let you separately fetch metadata
  • object stores may have access control per item (and per moderate unit), kv often has none
  • object stores may avoid transactions (there's typically no need), and be more eventual-consistency, particularly in designs that want to scale well
as may kv.

For example, S3 is an object store much more than a kv store - though .


It's worth pointing out that a filesystem is a database.

Code (usually kernel code) manages all the data structures, it has strict semantics that ensure things happen in an atomic, consistent way.

...well, sort of. It turns out it's really a relatively poor database, for a slew of reasons.

  • many things are not transactional, meaning you need a lot of error handling
A treewalk can find itself erroring out on something that it was just now told existed.
This is where NoSQL wouldn't help much either - they rarely even try or pretend to be transactional.

  • Tree restrictions that are hard to keep
say, filesystems tend to have a limitations like '255 characters in a filename' and '32667 characters in a path'.
so what happens when you're a few bytes under the latter limit, and you move a directory into a subdirectory so that directory doesn't break that limit, but files in it now do?
if that move is just "directory parent is now this other directory", this is often not checked
which means it will work until some API calls break
but if there is another, it might not.
the errors tend to be quite unhelpful

  • Text encoding is usually... breakable.
In bytes-based unices, which are conventionally UTF-8, it's easy to get invalid byte sequences.
In windows, which are mostly UTF-16, it's entirely possible to get invalid sequences.
Languages that want to interface with both linux and windows have to do even more creative things

I have one file on my windows drive that I haven't been able to delete in years. Explorer cannot do it. Probably something can, but I'll be damned if I can find even the reason it exists this way

As such, APIs that want to be portable have to wrap two different existing APIs creativity, and somehow deal with the fact that the underlying filesystem might well allow incorrect data.

  • Tree structure can be weird
for example, symlinks and links may mean that what you previously thought was a tree is actually an acyclic graph or even cyclic graph instead.
the last meaning that treewalking is technically infinite, except it will error out on length at some point

  • there are some practical cases of unnecessarily bad algorithmic complexity
consider e.g. an ls that goes through items and stat()s each, when that stat happens to be implemted in a way that makes it linear-time lookup. That makes a simple directory listing quadratic time with the amount of entries
and trust me, windows is no stranger to being slow as molasses when viewing more than 20K or so things in a single directory.

The POSIX interface (and the windows one too) is defined very serially.

Unnecessarily so, but thirty years ago it mostly worked fine, and now we can't change the semantics.

which is why distributed filesystems will point out their own, non-standard API.

  • metadata isn't centralized / indexed, so most queries are inefficient

It's great for robustness against local errors, sure. And not an issue for many specific accesses. But very poor for others.

Consider that a treewalk on a POSIX style filesystem is tens to hundreds of thousands of syscalls, that also have to happen in absolutely strict sequence because the semantics require it. Not even the much smaller latency of SSDs will help much here.

That said, 'list everything you have' is also something you never want to do with any online stores.

Example: I have one directory structure for a bunch of structured data - 500K files, 19K directories.

When the metadata is not cached in RAM, it takes eight minutes just to walk through and just list them.

That's just stats, directory-opens, getdents, because that is the only way to fetch that structure

When it is cached, it still takes three minutes

I have that same directory listing mirrored in a database. It's milliseconds in total to do most things I care to do. And a few hundred milliseconds for rare things. Even a total dump only takes seconds.

When you work with things like S3, you'll probably find yourself reflecting how this is sort of file storage in that buckets are like folders, objects are like filenames that refer to arbitrary-sized blobs of data, but without things that can go wrong at directory-structure or weird-filename level.

Computer data storage - Partitioning and filesystems


The old standby.

Also great at structure, generally the best consistency managers we have, and allows things like relational integrity that can actually be enforced.

The consistency is not due to their relational nature, in fact that makes things a little harder. It's just that that was one of their design goals, because they used to be the only real choice in serious database.

That same consistency is what fundamentally puts a limit on their ability to scale. (even loosening the consistency management only makes them somewhat faster. It is their overall model that doesn't scale)

NoSQL scales better in part because its guarantees are simpler, and less strong.

NewSQL is an interesting new balance between the two, but comes with footnotes that you really should know.

The below won't go into established relational databases. It is assumed you know how to find information about Postgres, MySQL and such.

...though things like Apache Derby (previously IBM Cloudscape) can be interesting if you want to, basically, embed a small (~3.5MB) RDBMs within the same VM as your Java application. (though it can also be set up client-server style) Or, perhaps more likely, something does this to give you standard SQL OLTP without requiring you to install and set up something more complex.

Key-value stores

You ask for a value for a given key. Typically no structure to the data other than your interpretation after fetching it they lie very close to object stores and blob stores (the last basically file stores without filesystem semantics).

When this fits your data and use, these are often low-bother ways to store a whole bunch of information, often with some corruption recovery.

If you have something in a RDBMS where you are actually mostly retrieving by primary key, and doing few or no joins (which may include things like simple data logging), you could use one of these instead.

Disk-backed or not Various file-based stores (see e.g. File database notes) are effectively disk-based key-value stores.

Since they are often indexed, and cached in some way, you may easily think of them as hashmaps that happen to be large and stored on disk.

Used more than you may think; Berkeley DB is used in many embedded setups.

There has been some revival in this area. For example, Tokyo Cabinet is basically a modern, faster reimplementation of the dbm. (with some extensions, e.g. Tokyo Tyrant to have it be networked, Tokyo Dystopia to add full-text search).

When not disk-backed, they are effectively in-memory caches (e.g. memcached), and sometimes also useful as message broker (e.g. redis).

Document store

key-value where the value is structured data, often not following a strict schema.

It is also frequently possible to index on these fields

Often presented as JSON (or XML, though XML databases can be considered a specific type of document store).

In contrast with e.g. relational data, documents are often altered individually, delivered as-is, and not heavily linked.

Column store

Regular RDBMses store data by rows, which makes sense when rows represent individual things you usually handle one by one.

A column-oriented DMBS basically store columns at a time, which makes sense when you're typically working on columns, e.g. primarily doing aggregations, summaries, statistics, or analycics on what works out as a column at a time (with less focus on data relation and structure).

It also tends not to need indices for these applications.

There aren't really NoSQL column stores I know of, but it's good context for...

Wide column store

A wide column store, sometimes extensible record store, could be thought of a key-value store where the key is pair like:

(row,col) -> data

...though the the row,col name only suggests that a common use is a way of storing a matrix.

Arguably, you still have a table-style data model (and e.g. Cassandra tries to be maximally SQLish ).

This e.g. allows

  • modeling a single matrix-like table, sparsely
  • modeling EAV-like storage and access (more scalably than putting it in relational)

These stores are often also distributed, and add further properties to the way data is organized, and accessed, to make certain uses more practical or efficient.

For example, bigtable chooses

  • read/write under a row key are atomic
  • data under a row key is ordered

...which helps related sites be on similar nodes, helping access speeds

  • columns are grouped into column families
which makes them more structured under the covers (and helps access efficiency)

Bigtable suggests access like:

(row:str,column:str,time) -> data

In particular, various have the idea of column families (making them more structured than e.g. EAV would suggest)

For another example, cassandra has partition keys, clustering columns that helps direct how data is stored.

primary key is partition keys + clustering columns?

The query capabilities vary

From another view, each row still maps to many columns that could be selected. Part of the access efficiency comes from the fact we need only select the columns we are interested in, and this is part of the lookup, not the filtering.

Wide column stores are, by themselves, just extensible record stores, but various implementations add additional properties

Examples: Bigtable (proprietary) HBase Cassandra Cosmos DB (when using its Cassandra API)

Bigtable is a distributed data chunk storage.

While bigtable is built on things only available within google, implementations based on the whitepaper are, like Cassandra and HBase (on top of HDFS)

It aims to meet a moderately specific sets of requirements, which happened to be useful for google search, gmail, and such.

It aims to serving potentially large chunks of only loosely structured data, with low latency where possible/necessary.

It actually is a moderately specific a data/access model, with ordering that allows for a B+-tree-like distribution model across various nodes

The data model is described in the whitepaper as "a sparse multi-dimensional sorted map"

In other words, a map.
With some fancy implementation details.

In terms of uniqueness, the map is from a 3-tuple:

row key (string),
column key (string), and
timestamp (int64),

to a string value

row key (string)

(row,column,timestamp) -> value )

(Note that to Bigtable itself, these values are uninterpreted bytes. It is up to clients to give it a meaning, as necessary, e.g. serializing )

This is not the way you access Bigtable, and most real-world applications of Bigtable are neither quite like a map or like a row-column table. Both of those can be useful assisting mental models, but you are likely to use it quite differently. Note that columns are weak versions of RDBMS-style columns at best.

(Note that you can use the bigtable setup in more than one way, because there is no predefinement or enforcement type of schema, only one implied in use.)

Consider for example the white paper's example is to store, for each URL, the page contents and sites that linked to this one (data you might want want for some particular linkage experiment or so, and would generate from crawl data).

For example, if and have links to, you would have stored items probably at least including the following:

('', '', 0 ) -> 'link name used from foo'
('', '', 0 ) -> 'link name used from bar'
('', 'contents:', 0 ) -> 'whole_html_page'
('', 'contents:', 0 ) -> 'whole_html_page'
('', 'contents:', 0 ) -> 'whole_html_page'

In this case, anchor is the column family you are using to map from link URL to link name.

In comparison to RDMBS, you would have a few options, and only adding an anchors table to the schema is even vaguely sensible (a compounded list in a single column would not scale, and a column per site would be crazy because of the way that is stored)

In comparison to maps, well, you can indeed use a basic a map like the above, creating a key single key from the elements above.

Except, of course, Bigtable rows and columns have meaning, in terms of querying, management, access control, (and also in implementation details, such as that compression happens per column family), and such.

Mutations happen atomically at row level. You can open a row as if it were a record, and add/alter/delete columns from it, then apply (store).

Reading can be done in a few ways, but mostly done by column family/families in a row or in any row(verify), optionally with row/column/timestamp filters(verify).

See also:

Often used with

that imitates a row-column setup at scale (and often with column sparsity)
(with the implication that each cell can be formatted differently, but you usually don't want that)

Depending on how you use this, you could store something imitating tables, rows, and columns, but implements it as a variant on a key-value store.

This is

unlike classical databases in that it doesn't store rows as units,
unlike column stores in that it also doesn't store columns clustered
though wide-column can be hinted to do that if possible


sometimes lead to very compressible storage
you may have higher per-cell overhead, but also be fairly predictable
for extremely large stores, this scalability will often make up for (for reasonable queries)


the column family stuff (hinting and grouping) can make a lot of difference
which is great if you know how to use it
and less great if you don't


  • at smaller scale, it tends to be less efficient.
join and queries involving multipe attributes are often less efficient
writes may be less efficent
Consider e.g. that updates for multiple things in a 'row' has to go to ten different places, and in a classical database just to one.

So arguably, this is more of an OLAP than OLTP thing.(verify)

Examples: Cassandra, Bigtable, Spanner, HBase



A graph in the mathematical sense, of a node-link structure.

Certain types of data are very handy to see this way - sometimes because it is a good fit, sometimes because it is structured but still free-form (so you don't end up with 100 relation tables and a headache of joins, and can be much faster when you typically look for very small parts of that data)

Search engine

While generally considered an index based on a primary data store elsewhere, that also makes that searchable index the thing you actually use.

Plus there are projects that do both.

Message brokers

Message queues are useful to send messages when more than two endpoints are involved.

Not a storage model, or NoSQL, they just turn up in the same places because they are really useful tools when you're scaling things up, be it around clusters/swarms, whether microservice style or not,

Also, some of the newer implementations made it a it a lot easier to interchange not only between threads and processes, but also between hosts and even datancenters with less work.

Details beyond that vary a lot, like

delivery model
ordered (e.g. fifo) or not
single consumer or broadcast
one-way or two-way
awareness of balancing
whether things are removed only once acknowledged (useful for job queues)
whether there are priorities in the queue
whether it's guaranteed low-latency or not
whether delivery is guaranteed (and whether you get a list of things that failed to deliver)
whether messages are federated
whether messages are kept temporarily or forever,
whether it's working from a central broker or not,
whether it's guaranteed low-latency or not
whether security is a concern

Data consistency and versioning, and its concepts


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

ACID (Atomic, Consistent, Isolated, Durable) is a set of properties you want to guarantee that the data it shows you is correct and consistent and up to date at all times.

'ACID compliant' is intuited as 'a guarantee that my data is safe and correct'.

For a large part, that is about how a sequence of interactions is handled, in particular

dealing with sequences of statements that must be applied all together, or not at all: transactions
parallel execution
because most real practice has more than one user
because it make most parts a lot harder (arguably, there should be a P in the acronym, because parallelism is usually the part that introduces the most complexity, edge cases, and tradeoffs, and is sort of important to understood well)

The parts:

  • Durability
roughly: when the backend acknowledges a request, this change will survive, and you'll get it back as-is, even around various typical malfunctions
in part a 'guaranteed to be actually written to disk', in part about recovery, in part about ensuring other ACID parts during recovery
and implies implementation details like "write a new copy, then retire the old one" style journaling (see also e.g. WAL) is a popular solution - instead of e.g. "update/overwrite the existing data as you go" because while that can often be a little faster, failure halfway through might make that unrecoverable)
  • Consistency
roughly: data respects all imposed integrity rules (constraints, cascades, triggers, etc.) at all times
attempting a operation that isn't allowed will be rejected, and will never lead to showing invalid state to clients at any time
if the transaction turns out to violate a constraint by one of its later actions, it must be rolled back
this is basically impossible to do without atomicity and/or isolation
  • Atomicity
roughly: a set of changes either happens fully, or not at all
since many operations in a transaction aren't even individually atomic, this certainly requires extra bookkeeping (e.g. of before and after states)
the easiest method is locking, but this also puts a very hard limit on concurrency. Concurrent methods are more involved.
  • Isolation
transactions should not interfere with each other when concurrent
ongoing interactions are invisible to each other; even when actually handled in parallel, they should look to any client as if they were done sequentially; what one transaction alters is hidden from all others
largely means they cannot react to new state from another transaction
required for consistency
this conflicts with concurrency (more fundamentally than some think - varying somewhat with how complex the data structures are)
(at the very least, things become more complex when the data they deal with overlaps in any way)

Even with those words, and many more from the articles they came from, ACID still remains a fairly open-ended description, not a set of rules of how to implement that.

And if some of these seem to overlap, that's because they do.

Which leads to arguments like "is what isolation says already covered by atomicity?", so it's probably there because we planned to relax it from day two.

In fact, it seems to have been added later indeed;

ACD from J Gray (1981) "The Transaction Concept: Virtues and Limitations"'
ACID comes from T Haerder and A Reuter (1983), "Principles of transaction-oriented database recovery"

One way you can see the A/I difference is that

atomicity is about all of a transaction's writes becoming visible atomically as table state
isolation about different reads in another transaction not seeing different versions at different times within it
one focuses more on correctness, the other more on ordering
consider you could write a system that switches in one transactions's new state perfectly atomically, and have another transaction read from the earlier state at one time, and the new state later (Nonrepeatable reads, see below)

These details is also one reason SQL defines isolation levels, which actually relax from strict ACIDity.

So yes, real-world relational databases are typically not configured to be fully ACID (and a few can't even be).

This is also roughly why a lot of NoSQL often does not care about strict transactional correctness.

ACIDity of RDBMSes versus NoSQL

ACID is historically largely related with RDBMSes.

Not because ACID means RDBMS - ACID is just a set of principles, without even saying how you'ld implement them.

But because RDBMs's features means that without considering ACID in your design, you'll make a mess.

NoSQL is different. The same relaxation of the of the data model that allows it to scale also lowers the amount of guarantees - that even have to exist.

With no enforced relations, and generally no transactions,

Atomicity is almost implied if you are restricted to only 'get' and 'put' single items.
and it is generally not guaranteed on bulk operations. They'll say something like 'best effort'.
Isolation is irrelevant (no transactions)
Consistency has almost no rules to adhere to (beyond rejecting nonsense requests)
(Durability is a separate chat)

So ACID is sort of... irrelevant on most NoSQL.

To even name it suggests you consider it to mean "operations probably not incorrect" rather than transaction correctness (which NoSQL generally doesn't even try)


ACIDity of single-row (or key) alterations and
ACIDity of multi-row (or key) alterations.

In RDBMSes, multi-row ACIDity is required to be taken remotely seriously.

Most NoSQL isn't multi-row ACID because there are no multi-row alterations in the first place. You often don't get to group alterations in transactions that get applied atomically.

So they're ACID because they're single-row ACID.

The issue comes in when they call themselves ACID because they're single-row ACID, but they also happen to allow multi-row updates.

If you do that from your code, the result is not ACID overall. You don't get that guarantee.

And we would really like multi-row updates. Particularly around denomralization, it would be great if we can say "Update this related set of denormalized calculated fields, either all or none, because that means the app won't show something inconsistent". By and large, we cannot. We are running either on 'hope for the best', or something inbetween because some BASE implementations are pretty neat.

But technically we weren't lying.

Also, Good luck knowing for sure.

The D is sort of a separate case.

Durability doesn't mention time.

In RDBMSes, it often means that 'if I acknowedge, it means it's guaranteed to be actually written to disk'. Be it tablespace or recovery log, you'll see this data back unless something goes unusually wrong.

NoSQL is split between

Write-Ahead Logging like that, and
"sure, we save our in-memory state to disk every few minutes"
(and a few "neh, we're in-memory only", which is fair)

And in a lot of use, it's a fine tradeoff. Just know that you're making it, and that this might be a bit of a stretch of D.

Aside from NewSQL, there are few NoSQL cases that try to give you multi-row ACIDity (or detail the cases).

VoltDB [1]
FoundationDB [2]

You can ask "Is S3 ACID?" and get completely different answers, including

  • basically yes, the individual operations are very predictable
It only does get, put, list, and delete.
get and put of completely independent items barely gets into any consistency.
list may not reflect the objects present a second later, but you would only expect that if there were transactions, which there are not.
  • No. It's not a database, it's an object store. Your question doesn't make sense.
  • No. It does not give read-after-write consistency[3]
(this changed in 2021, basically after 15 years, and still seems slightly qualified)
  • Maybe; you could add something like transactions on top,
but it'll probably work out as batches and fallback cleanup. Close enough, right?
You still don't get multiple such interactions working in a serialized way. No I or A, really.
But hey, you generally wouldn't expect to.

Also, if you use S3 as a backend to a metadata store elsewhere - which is common and convenient - that's even less likely to stay consistent with each other over time.

Isolation levels

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

ACID is great theory, but it turns out that most parallelism will easily be faster only under some conditions, and in various real-world ones things can still come down to "do things largely in series, or it becomes way too complex to handle".

So another question pops up: Do you want a tradeoff between which guarantees we make, for extra speed?

Isolation levels are a controlled way to loosen strictly ACID behaviour in RDBMSes.

Roughly speaking, isolation levels control what sorts of non-ACID behaviour is allowed whenever you don't do explicit, fairly complete locking.(verify)

Three things you want to avoid (at least, the most mentioned and relevant):

  • Dirty reads - refers to fetches that ignore any transaction or locking
Means we don't have to wait at all - great for speed!
May e.g. mean one connection may read data that is not yet committed by another
when multiple clients are writing relational data that has to make coherent sense, you can assume this will go wrong at some point
  • Nonrepeatable reads: getting different values for the same row when trying to read it multiple times in the same transaction
...e.g. because of another transaction's committed alterations. Reads can be non-dirty, but still non-repeatable.
We usually care about nonrepeatable reads refer to those within the same transaction. There is also the concept of "nonrepeatable within the same query". The fact that we use the same word for both is confusing, yes.
  • Phantom behavior / phantom reads refers to transactions 'magically' seeing new rows
i.e, seeing that the result set changes (rather than that specifically identificable row values change)
e.g. caused by WHERE clauses that select from another table at different times, which can find new matching rows after they are committed by something else.

Approximately speaking,

dirty reads are reading non-committed data from another transaction
nonrepeatable reads are often because another transaction COMMITted an UPDATE
phantoms are often because another transaction COMMITted an INSERT or DELETE)

'Approximately', because with more scrutiny there are a few more specific cases .

You don't have to think about those terms directly, because SQL-92 defines isolation levels:

    • Basically 'anything you find will do', does not use/observe read locks
    • allows dirty read
    • allows nonrepeatable reads
    • allows phantom behaviour
    • Won't read uncommitted data, doesn't ensure that read rows won't change within the length of this transaction
    • avoids dirty reads (only)
    • allows nonrepeatable reads
    • allows phantom behaviour
"Won't read uncomitted data" and "allows nonrepeatable reads" seem to conflict. The point lies in the difference between a query and a transaction. Nonrepeatable reads in a query means it will not see different values if it needs to read a row twice. But two identical queries in the same transaction can still see data if another transaction has completely committed data to the relevant table. If you want all state to be repeatable within your transaction, you want either locks, or the following:
    • Won't read uncommitted data, and does ensure read rows will not change within the transaction. Put another way, you're essentially reading from the same snapshot each time.
    • avoids dirty reads
    • avoids nonrepeatable reads (the main point, since it's the name of the thing)
    • still allows phantom behaviour (...wider locking may avoid it, without being quite the same as SERIALIZABLE. Depends.)
    • usually implemented by placing locks on all data relevant to the query
    • strictest -- no transaction interaction possible at all
    • Tends to be noticeably slower, because it is often implemented with zealous locking, or sometimes even by never handling two transactions on the same object(s) at once(verify).
    • avoids dirty reads
    • avoids nonrepeatable reads
    • avoids phantom behaviour


  • Some database engines may offer their own additional levels
e.g. SNAPSHOT[4]
  • oracle and postgres say serializable when they mean snapshot(verify)
  • most of these isolations aren't ACID - that's sort of the point.
  • Most database engines can be asked to behave to different levels, e.g.
MyISAM doesn't support transactions at all, so is probably best described as READ UNCOMITTED always
InnoDB supports all four(verify), and defaults to REPEATABLE READ
its MVCC nature means it internally has the equivalents of READ COMMITTED and SERIALIZABLE.
Default is READ COMITTED(verify)
The other two don't exist, and asking for them actually gives you the next-strictest behaviour:
Note that given its MVCC design, reads tend to spend less time waiting for locks, so looser settings often aren't as useful
  • DB2 has its own terms and goes into a few more distinctions and may be interesting reading.

See also:

DB2's terms may also be interesting; they go into a few more distinctions

  • Repeatable Read (RR)
  • Read Stability (RS)
  • Cursor Stability (CS)
  • Uncommitted Read (UR)
  • No commit (NC)

See also:

Considerations to allowing dirty reads
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Using SET TRANSACTION ISOLATION LEVEL to READ UNCOMMITTED (or e.g. MSSQL's per-query WITH(NOLOCK)), can be used to allow dirty reads to happen, which means the value is read out without locking considerations.

This can be useful when any value currently in there will do. This may be true in the case of social-site profile information, continually updated statistics tables, and such. Any value from now to a few seconds ago will be perfectly good - using locks and listening to transactional accuracy will only really mean more waiting.

Of course, dirty reads may at any time return uncomitted, old, or duplicate data (because you are asking the database to violate ACIDity for the read), meaning you should only do this when you know you'll do no harm, such as when you only read the value to show it and no write/update will even be based on it. (Much the same goes for non-repeatable reads, really)

Note that MVCC-style engines, such as postgresql and MySQL's InnoDB, avoid a lot of classical style locking by isolating data versions, which means there is little speed increase in allowing requests to give you dirty versions.

See also


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

There are various takes on what happens when systems become scaled up and distributed, in terms of guarantees, and also things like bandwidth and job rate and the things that hold them back, and how such details interrelate.

One of the better known ones, dealing just with the guarantees, is Brewer's CAP theorem[5], which basically says that of three properties, no matter your answer, you are always choosing a tradeoff.

CAP refers to:

  • Consistency usually means "do all nodes see the same data at the same time",
Consistency in CAP means is mostly more about ordering and timing (this contrasts with ACID, where this concept is more part of atomicity, and consistency in ACID means "follows all the rules and restraints imposed, there often typing, relations, bad values, duplicates, etc.")
partly "once new state is successfully accepted, we never send older state back" logic, partly agreeing which version is current
  • Availability - you will get a response with reasonable time
a response that is data and not an error
but there is no guarantee that it is the most recent data
(note: an answer "regardless of things like node failure" implies that node failure does not halt the system)
  • Partition tolerance - system continues even with imperfect communication
partitioning refers to "parts of a system that don't agree on state"
dealing with issues like
"one host fell over due to hardware and then came back"
"some message got lost"
and often means the partitioned system needs to be resolved to become one whole again somehow.
...automatically if at all possible

These three are entangled to start with.

More so when you add "...within reasonable time", which CAP ignores, and you generally cannot.

The larger the scale and/or the lower the required latency, the more you lean towards accepting you trade away one of these things, in imperfect conditions or at all.

Preferably in a well-informed way.

Some potentially misleading things

  • The three concepts aren't quite the same - C and A are about properties that stick around under perfect conditions,

P is specifically about an imperfect condition - partitioning.

  • Yet in designs, trying to strongly guarantee two of these comes at the cost of the ability to guarantee the third so well.
It doesn't become exactly two either. In terms of misbehaviour that you may want, all are a matter of degree.

  • The partitioning may be an exceptional thing, or may be a regular thing
a system may try and manage to be CAP almost always. If it never partitions, there is no good reason to give up CA in regular operation.

Since most distributed systems will try to resolve partitions, and can often do so within reasonable time, this can be made to be fairly rare, at least for everyone not deploying at continent scale.


...historically had no partition tolerance at all. The relational model doesn't distribute, so we tend to make monolithic things that can't partition to start with.

That, and they've been the go to for consistency management.

The ability(/assumption) of communicating perfectly makes it easy to do consistency and availability, as long as the hardware behaves.

So these are classical examples of CA.

You can build something on top of classical databases that does have partition tolerance, but you have to most of the work yourself.

If you assume a system can partition.

Now yes, we do want to know about behaviour when it partitions.

  • C+A is a focus for consistency management, lock management, fastish transactioning
usually doesn't like partitioning - many barely handle re-syncing after a partition.(verify)
e.g. classical DBMSes

  • C+P that under regular circumstances also provides Availability, but when partitioned does not provide Availability
for large things that act as a primary data store(verify)
but often basically says "pause until a partition stops happening, e.g. going read-only or unavailable"
so that it stays deterministic with a simpler model, and more easily rejoinable
there has been research to make that more flexible
e.g. mongo, redis, HBase, bigtable, BDB (verify)
  • A+P that under regular circumstances can provides C
for speed at scale
during a partition, this will often let parts operate, with less care that nodes's data might go out of sync
with the hope it can resolve most inconsistencies
e.g. CouchDB, Cassandra, riak, dynamo (verify)

This isn't a complete view either, because there are lots of gradual inbetweens

(particularly at moderate scale, where resolution may be less troublesome than it is at hyperscale)

  • say, some technically fall under one but work better in reality
A+P may feel scary to do for primary storage
but consider that, for example, you have heavily nodes, and one falls out. If you have a s
Say, if all parts of a system can take a good guess at what went wrong, they might be able to
retry on a different node so that clients don't have to
inform each other and the client for some automatic reconnects,
inform the failed node that it needs to update to current state before joining
reduce the amount of inconsistency to make automatic resolution faster,

leave more of the system in read-write.

Doesn't change the category it's in, but may be a lot better behaved in practice.

In some cases, design choices staple it down pretty specifically.

Various NoSQL is A+P with eventual consistency, often meaning "Yes I updated it locally (and will tell the others as soon as possible)"

In others you can choose one of a a few different points on that continuum. For example, riak's consistency resolution is pluggable, meaning it has a "give me any answer right now" alongside a "do it better within longer".

The example ATMs from the link below is interesting

you'ld expect it to care about strong consistency
but that would mean it would pause while partitioned (because you can't know the balance is good)
in practice, that costs banks revenue, so they basically allow A+P behaviour
that means you could overdraw in ways you normally can't,
but this is resolvable in other ways,
you can limit it by setting the maximum withdrawal lower while partitioned
so probably fine unless people try to exploit it specifically, and even then the fact that accounts are rarely anonymous so banks have recourse

Sometimes there are tweakable tradeoffs (e.g. between C, A, and P) with settings, or initial-setup choices.

More commonly, it's tied to overall design, and what kind of purpose the system is intended for.


One real take-home may be given that things will go wrong some time, know your actual system's failure modes and how to resolve them

Most people don't seem to, and "the site is down, FIX IT" is not the best time to start learning the theory.

ACID databases focus more on C(onsistency), most people wanting NoSQL focus on the A(vailability), (as well as easier guatantees on time taken when C is about 'is it stored yes/no, don't care about ordering' rather than '...and what about correct ordering, and all its references at all times').

Most systems try to deal with P at least a little, because just because "you disconnected a cable and now everything is permanently broken" is not something you get away with.

That said, there is a difference between dealing with the occasional latency hiccup.

Consensus protocols also rather improve things,

a very useful thing to have for various things at scale, things like leader elections, atomic broadcasts, multicast, state machine replication (the last being useful to build systems without a single central point of failure)

Consensus protocols/implementations include


See also


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Base is a wink at ACID, if much less theoretical .

BAsically available - roughly meaning "won't block much". There will be a response to all requests, but that response is allowed to be

recent-but-old state (or, when data interrelates, logically inconsistent state).
'try again later'

Soft state - you should assume the system might be catching up, even while there is no input.

Eventual consistency: The system will tend towards well-propagated consistency once it stops receiving input.

Eventual consistency

RDBMSes do what you could now call immediate consistency.

Which basically means that "if I acknowledge an alteration, it's there in my store

Eventual consistency usually means

  • "I acknowledge I've received that data and will process it as soon as I can"

This is often before it's written anywhere, so it may hit a conflict, it may be lost in an error.

It usually won't, which is why this is a reasonable cost/benefit thing. The benefit is that we can say yes earlier. The cost is less guarantees.

Eventual consistency also implies data is available in the short run, and consistent only in the long run.

See also:

Broad qualifiers


You'll see these two as qualifiers, as kinds of setups, and sometimes as specializations of specific kinds of systems/databases.


  • OLTP basically means systems with immediacy
  • OLAP often means bulk processing for secondary reasons
  • Systems can be made to do both to some degree, but OLTP often comes with people wanting it to be as immediate as possible

Online transaction processing (OLTP) refers to any system that does day-to-day transactions

as realtime as possible
as safely/serially as configured

This usually puts focus on

  • operations being simple and fast
because if things take longer than milliseconds, it's going to scale very poorly, particularly if done serially
  • strong consistency
    • within systems - where this is simple
    • between systems - to the degree this is possible

OLTP that uses more complex data models tends to be implemented in RDBMSes, because the consistency you often want is done for you.

The focus on consistency may mean people defailt to RDBMS anyway, unless they expect scalability issues from that choice.

Online analytical processing (OLAP) tends to refer to "analyse a bulk of data" setups.

This often means that, mostly unlike OLTP

collecting chunks of data collected over a larger amount of time
often done for business reasons, rather than direct use by users.
classically was more about generating reports, analytics, etc. to be looked at later - There are few cases where a response coming seconds later is any problem (or days, if we're talking about reports)
now more frequently also to be looked at live (scale of seconds), and is starting to overlap with more complex variants of health monitoring
(compare each to OLTP, where responses are measured in milliseconds)

On mixing the two

When doing OLAP, you could just send queries to your OLTP system, but because this often implies bulk queries, this can slow down that OLTP system, which could be critical.

As such, there is a habit to siphon off data from an OLTP database to a separate OLAP database and separate processing host, to work on independently.

There are now also Hybrid transactional/analytical processing (HTAP) is the attempt to do both a transactional database and analytics.

The more at-your-leisure it is, the more that classic ETL style processing covers you needs, but these days, there are OLAP systems that aim to do real-ish time monitoring, and more explicitly balance scale and speed (also with NoSQL-ish scaling).

It's often still factors slower than OLTP, or at least behind a few seconds (e.g. when running off a replicated database server)
yet still many factors faster than generic warehousing, "throw an ETL script at it", "produce a report" style things),

OLAP might be based on structured data from a central OLTP system you have, or it might ingest less-structured data (like logs).

In the latter case, terms like 'warehouse' or 'lake' turn up.

As a broad qualifier, OLTP and OLAP can describe

  • a system as a whole - goals, database, ode, the hosts used
  • a type of database - e.g.
RDBMS is a fairly classical case of OLTP
most NoSQL (e.g. kv and document stores) are still largely on the OLTP side (if only because few do much processing themselves)
e.g. ElasticSearch used for log analysis or monitoring (see e.g. logstash, kibana) is more on the OLAP side
  • specific queries
e.g. some complex queries may be OLAP
queries may get so complex that people do some pre-structuring

It also has some applications to architecture. OLTP often implies tiered applications, whereas OLAP modelings is much more varied, often somewhat custom (though OLAP NoSQL does exist), from monolithic ETL batch script to complex interdependent streaming setups.

Lakes, warehouses, swamps, pits, and graveyards

Data warehouse means data in a processed, probably actively useful form.

This term has been around since the seventies.

Data lake basically means 'not a data warehouse', and includes e.g.

  • "new data, in raw form, that we haven't parsed yet"
  • "we're tracking things, we'll see how it's useful later" the point that there is so plan that

you don't know what you have
you will have a really tough time figuring how to use or combine or even find this data later
you created what you could creatively call
data swamps (inaccessible)
data pits (won't produce clear insights)
data graveyards (you're not using it, and you never will)


  • searching for 'Data lake' will mostly give you IoT related results -- mostly because fancy promises aside, IoT primarily just means 'collect all the data (personal or not) and see what we can do with it later'.

  • Data lake sometimes also includes the software that accesses the data - anything from ETL/ELT, and sometimes much fancier analytics.

  • In practice, though,
    • data warehouse and data lake are often used interchangeably.
    • (so) when data warehousing can mean both, then ETL and ELT are both seen in data warehousing


Data warehousing historically usually implies ETL (Extract, Transform, Load)

ETL means clean up your data (data types, data validity) before when entering it into the system

...often so that it can be stored in a structured database, meaning
you have specific goals in mind
so have already thought about efficient querying, and ensured faster loading
and you need to define that database up front

Data lakes more often take the ELT (Extract, Load, Transform) approach:

just store what we got, in the form we got it in, and worry about it later

This is in part merely pragmatism:

  • we can capture any and all data
  • and there's no point in spending time on data we may never actually use
  • you need not restrict yourself to your original goal. You might be able to ask more varied questions, and get more answers, if you keep the original data around.

However, the same thing that leads to these upsides (you didn't do anything yet) lead to downsides:

  • you will parse and validate it on use
...either on every use
this tends to be true for ad-hoc querying, which tends to mean mapreduce / pipeline solutions
(who cares if that takes more power and is slower)
...or on first use, e.g. if you validate-and-store into a working set for every question you work on
(sort of meaning you create a warehouse for each use)

Some such databases, by type

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

File databases

Not a model but a practicality:

File databases are useful to persist and store moderately structured data on disk,

usually within a single file.
and usually from a single, simple library

The goal is often a mix of tendencies like

  • persist a larger set of possibly-somewhat-complex data without having to think about how to do that at lower layers
  • persist large amounts of data without having to think about efficient lookup so much
  • a relatively simple minimal library, and minimal configuration.
  • (because it need only) support only a single program/client
  • relatively relatively read-heavy

key-value file databases

The simpler variants may be based somewhat on the older, simple, yet effective dbm, a (family of) database engine(s) that stores key-value mappings (both strings) in a file (sometimes two, splitting out some metadata/index(verify)).

These may be understood as an on-disk variant of a hashmap, allows rebucketing to have the hashing scale well, and fixed-size buckets to allow relatively efficient modification.

They are also libraries, so easy to embed and avoid network serving details.

dbm family

Within the dbm family, the ones that are interesting enough to use are probably roughly:

  • dbm, 'database manager' (1979, AT&T)
  • ndbm, 'new database manager' (1986, Berkeley), added support for the library to deal with multiple open databases
    • sdbm, a 1987 clone of ndbm, made for licensing reasons
  • gdbm, 'GNU database manager', also added arbitrary-length data
  • Berkeley DB (a.k.a. BDB and sometimes BSD DB), optional concurrent access, transactions and such (ACIDity), etc.

There are also continuations / successors of this idea, including

  • Tokyo Cabinet and related are basically a modern reimplementation of the dbm idea
  • tdb (trivial database library): made for samba suite (so relatively recent), API like GDBM but

safely allows concurrent writers

  • tdbm: variant of ndbm with atomic transactions, in-memory databases
  • MDBM: memory-mapped key-value database store derived (like sdbm and ndbm)
  • QDBM

Berkeley DB notes

Berkeley DB (also known as BDB, and libdb) is basically a key-value map in a file.

It is a library instead of a server, so can be embedded, and is used like that quite a bit.

For simpler (e.g. not-very-relational) ends it has lower and more predictable overhead than bulkier databases.

Technical notes

The low-level interface does not support concurrent write access from multiple processes.

It has some higher-level provisions for locking, transactions, logging, and such, but you have to choose to use them.

From different processes, you would use DBEnv(verify) to get BDB to use proper exclusion. Most features you have to explicitly ask for via options, which also control whether the env is safe e.g. for multithreading but not between processes, safe between for several processes, etc. See things like DBEnv::open() (e.g. DB_THREAD, (lack of) DB_PRIVATE), and also notes on shared memory regions.

Interesting aspects/features:

  • it being a library means it runs in the app's address space, minimizing cross-process copying and required context switches
  • caching in shared memory
  • option for mmapped read-only access (without the cache)
  • option to keep database in memory rather than on disk
  • concurrent access:
    • writeahead logging or MVCC
    • locking (fairly fine-grained)
    • transactions (ACID), and recovery
  • hot backups
  • Distribution:
    • replication
    • commits to multiple stores (XA interface), (since 2.5)

Both key and value are byte arrays; the application has to decide how it wishes to format and use data.

Both key and value can be 232 bytes (4GB, though for keys that's usually not a great idea).
A database file up to 248 bytes (256TB, which is more than various current filesystem limits).

It uses a cache to avoid lookup slowness, and a write-back cache to be more write-efficient.

Format/access types

There are multiple types of access / file format. They provide mostly the same functionality (keyed access as well as iteration over the set); the difference mostly in performance, and only when the data is large, since if all data fits in the cache, this is an near-non-issue.

For larger data sets you should consider how each type fits the way you access your data.

If your keys do not order the entries, you should consider hash or btree. When keys are ordered record numbers, you should probably go with recno, a.k.a. record, (fixed or variable-length records).

You can supply your own comparison and hash functions.

More details:

  • Hash (DB_HASH)
    • uses extended linear hashing; scales well and keeps minimal metadata
    • supports insert and delete by key equality
    • allows iteration, but in arbirtary order
  • B+tree (DB_BTREE)
    • ordered by keys (according to the comparison function defined at creation time. You can use this for access locality)
    • allows lookup by range
    • also keeps record numbers and allows access by them, but note that these change with changes in the tree, so are mostly useful for use by recno:
  • recno (DB_RECNO)
    • ordered records
    • fast sequential access
    • also with key-based random access - it is actually built on B+tree but generates keys internally
  • queue (DB_QUEUE)
    • fixed-size records
    • fast sequential access

You can also open a BDB using DB_UNKNOWN, in which case the open call determines the type.

There are provisions to join databases on keys.

Versions and licenses
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(Note: this is probably incorrect in parts -- it's hard to find good details)

Early versions came from a library for simple hash, and later b-tree records.

  • BDB 1.85 was a version for BSD 4.4, released in 1992 under a BSD-like license (that makes it behave like the GPL)

A Netscape request for further features led to the creation of Sleepycat Software in 1996

  • BDB 2.1 was released in 1997, adding concurrent access, transactions

Sleepycat Software was acquired by Oracle Corporation in 2006. This seems to have had little licensing consequences on versions since then (verify)

Versions 2 and later, including the Oracle additions, are dual-licensed, under either:

  • The Sleepycat Commercial License: A purchased license that does not force redistribution
  • The Sleepycat Public License: software that uses Berkeley DB must be distributed (note: excludes in-house development)
    • apparently, the case of language support/wrapping is a special one (not unlike LGPL) as in cases like python and perl interfaces, the software that uses BDB is the language, not any scripts that use that interface/BDB. This does not seem to apply in the case of C library use(verify).

In other words, you generally have three options:

  • use freely while the application is and stays entirely personal / internal to your business
  • you have to distribute the source of the application that uses BDB
  • you have to get a paid license

There are now three products:

  • Berkeley DB (the basic C library)
  • Berkeley DB Java, a Java version with options for object persistence
  • Berkeley DB XML, an XML database using XQuery/XPath (see e.g. [6])

Some additional licenses apply to the latter.

See also

Tokyo Cabinet / Kyoto Cabinet

database library, so meant for single-process use.

Tokyo Cabinet (2007) (written in C) is a embedded key-value database, a successor to QDBM

  • on-disk B+ trees, hash tables, or fixed-length array
  • multithreaded
  • some transaction
  • no real concurrent use
process-safe via exclusion control (via file locking), but only one writer can be connected at a time
threadsafe (meaning what exactly?)

Kyoto Cabinet (2009) is intended to be a successor.

written in C++, the code is simpler than Tokyo, intends to work better around threads. (Single-thread seems a little slower than Tokyo)


  • Tokyo may be a little faster
  • Tokyo may be a little more stable (at leas in earlier days of Kyoto's development)
  • Kyoto may be simpler to use
  • Kyoto may be simpler to install


Lightning Memory-Mapped Database

  • Ordered-map store
  • ACID via MVCC
  • concurrentcy


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
  • cdb (Constant DataBase), basically a fast on-disk associative array [7] [8]

Relational file databases

SQLite notes

SQLite is a lightweight library-based database engine - no server, no configuration, and a database it works on is a single file.

Also meaning you can use it without requiring an admin to install and configure a database engine.

It's a little fancier than similar library-only databases. Its feature set includes:

  • support for most of SQL92
  • ACIDity (using file-based locking)
  • recovery via journaling
  • constraints
  • concurrent access
  • views (to a degree)
  • querying multiple databases (as in different files)

How well it scales depends on some choices.

general RDBMS scaling notes apply, e.g. well-chosen indices, using transactions and not the default autocommit
some SQLite-specific notes, e.g. that
writes that would go to multiple databases are more efficient in a transaction, some cache tweaks
it doesn't deal as well with multiple clients

Say the writers: "SQLite is not designed to replace Oracle. It is designed to replace fopen()"

Cases where SQLite is interesting:

  • embedded things - SQLite and BDB have both been fairly common here for a while
  • application memory between runs, other file-based persistence
The ACID thing is nice
the indexes are nice
  • structured file storage
so that you don't have to think about binary coding of complex data
  • interchange of nontrivial data between different programs / languages
(that doesn't need to be minimum-latency)
  • when accessed via a generic database interfaces, you can fairly easily switch between sqlite to real rdbms
can be useful during development
  • Creative uses, such as local caching of data from a loaded central database
you can also have memory-only tables
  • simple dynamic websites
particularly if mostly read-only

Limitations, real or perceived:

  • many-client cases: while sqlite will function with multiple users, it will perform better with fewer users
...though concurrency is much better since WAL, i.e. since version 3.7
  • Its typing is non-strict. In programming language terms, it is dynamically and weakly, basically duck-typed.
sometimes convenient, sometimes weird
  • no row-level locking (but then, that's not commonly the largest scaling issue)
  • no foreign keys (can be worked around with triggers)
  • no permissions (...with the reason that it's fairly pointless if you can read the file anyway)
  • the less-common RIGHT JOIN and FULL OUTER JOIN are unimplemented (verify)
  • triggers are somewhat limited (verify)


String or BLOB exceeded size limit seems to mean that one or both of:

  • a query size exceeded SQLITE_MAX_SQL_LENGTH (1MB by default)
you can avoid this by using parametric values (binding) instead of literal query strings
  • a row, text field, or blob field exceeded SQLITE_MAX_LENGTH (1GB by default)

(possibly other limits)

If course, you can set these these while compiling. SQLITE_MAX_LENGTH may be 2^31-1, and SQLITE_MAX_SQL_LENGTH may be no larger than SQLITE_MAX_LENGTH (and possibly also no larger than 1GB?(verify) -- not that wanting a 2GB query is very sane...

Array oriented file databases


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

CDF (Common Data Form) is a scalable, primarily array-oriented format, which seems to have been designed to use and to archive large amounts of sensor data, with it being self-described, appendable, with one writer and multiple readers. bears some similarities to HDF5 in intent, and is found in similar contexts.

In fact, NetCDF4 chose HDF5 as a storage layer under the covers, so libraries could easily deal with both CDF and HDF5 (though CDF data representation, especially classic, is more restricted in practical use).

In different contexts it may refers to an abstract data model, an API to access arrays, a data format, or a particular implementation of all of that.

This is a little confusing, in that some of them have less constant over time. In particular, the data format for netCDF can be roughly split into

  • classic (since 1989)
  • classic with 64-bit offsets (since 2004)
  • netCDF-4/HDF5 classic (since 2008)
  • netCDF-4/HDF5 (since 2008)

The first three are largely interchangeable at API level, while the last, also called enhanced, allows more complex data representations that cannot be stored in classic.

See also:

And perhaps

Hierarchical Data Format (HDF)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Now typically meaning HDF5 (though HDF4 still sees some use).

Describes a data model, library, and file format.


  • hierarchical refers to the fact that addressing-wise, it basically implements filesystem-like names within it
  • Stores various array-like stuff, and halfway clever about readout of parts from huge datasets.
  • structured, potentially complex data
primarily cases where you have many items following the same structure, and often numerical data (but can store others)
think time series, large sets of 2D or 3D imagery (think also meteorology), raster images, tables, other n-dimensional arrays
  • fast lookup on predictable items
offset lookups in arrays, B-tree indexes where necessary; consideration for strides, random access, and more.
a little forethought helps, though
  • portable data (a settled binary format, and self-described files)
  • dealing with ongoing access to possibly-growing datasets
  • parallel parallel IO has been considered
multiple applications accessing a dataset, parallelizing IO accesses, allows is use on clustered filesystems

See also

Unsorted (file databases)

Other/similar include:


Tokyo (and Kyoto)

Tokyo Tyrant / Kyoto Tycoon
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

database server, so useful when you want some concurrent use.

Supports expiry, so can act much like memcached

Tokyo Dystopia
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

full-text search system

Tokyo Promenade
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

content management system built around cabinet.

Presentable in BBS, blog, and Wiki style

Storagey stuff - kv, document, and wide-column style

MongoDB notes


  • Weakly typed, document-oriented store
values can be lists
values can be embedded documents (maps)
  • searchable
on its fields, dynamic
- which together specifies a single query operation (e.g. it always sorts before limiting [9])
supportable with indices [10]
field indexes - basic index
compound indexes - indexes a combination, e.g. first looking for a userid, then something per-userid)
multikey indexes - allows matching by one of the values for a field
2d geospatial - 'within radius', basically
text search
indexes can be:
hash index - equality only, rather than the default sorted index (note: doesn't work on multi-key)
partial index - only index documents matching a filter
sparse index - only index documents having that have the field

  • sharding, replication, and combination
replication is like master/slave w/failover, plus when the primary leaves a new primary gets elected. If it comes back it becomes a secondary to the new primary.
  • attach binary blobs
exact handling depends on your driver[11]
note: for storage of files that may be over 16MB, consider GridFS

  • Protocol/format is binary (BSON[12]) (as is the actual storage(verify))
sort of like JSON, but binary, and has some extra things (like a date type)
  • Not the fastest NoSQL variant in a bare-metal sense
but often a good functionality/scalability tradeoff for queries that are a little more complex than just than key-value
  • no transactions, but there are e.g. atomic update modifiers ("update this bunch of things at once")

  • 2D geo indexing
  • GridFS: chunking large files and actually having them backed by mongo.
point being you can get a distributed filesystem

  • mongo shell interface is javascript

Schema considerations

  • One to one relationships? You probably want it in the same document. For most It saves extra lookups.
E.g. a user's address
  • One to many? Similar to one-to-many.
E.g. a user's address when they have more than one.
  • Always think about typical accesses and typical changes.
For example, moving an entire family may go wrong because values have to change in many placers. (but then, it often might in RDBMS too because a foreign key would have to change in many places)
  • foreign-key-like references can be in both places, because values can be lists and queries can search in them
Usually, avoid setups where these lists will keep growing.
  • when you refer to other documents where contents will not change, you could duplicate that part if useful for e.g. brief displays, so you can do those without an extra lookup.
  • sometimes such denormalized information can actually be a good thing (for data-model sanity)
e.g. the document for an invoice can list the exact text it had, plus references made. E.g. updating a person's address will not change the invoice -- but you can always resolve the reference and note that the address has since changed.
  • if you know your product will evolve in production, add a version attribute (and add application logic that knows how to augment previous versions to the current one)

Also there's a document size limit

You can set _id yourself, but if you don't it'll get a unique, GUID-like identifier.

See also:

GUI-ish browsers:

riak notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Key-value store with a focus on concurrency and fault tolerance,

Pluggable backends, e.g. allowing use as just a memcache, or give it persistence.

Eventually consistent (with some strong-consistency experiments?(verify))


Ideally you fix the cluster size ahead of time. When you add nodes, contents are redistributed (verify)

Backends include

  • bitcask
all keys in hashtable in RAM (fast, but limiting the amount of items via available RAM)
file copy = hot backup (verify)
  • leveldb
keys stored on-disk
secondary indexes, so limited limited relational-style querying at decent performance
data compression
no hot backup
  • innostore
  • memory
objects in ram

It aims to distribute perfectly (and supports some other features by assumptions), which implies you have to fix your cluster size ahead of time

Pluggable backends mean you can have it persist (default) or effectively be a distributed memcache


Key-value store, distributed by using Raft consensus.

It is arguably a Kubernetes / Google thing now and has lessening general value, in part due to requiring gRPC and HTTP/2, and threatening to abandon its existing API.

CouchDB notes

(not to be confused with couchbase)

Document store.

Made to be compatible with memcachedb(verify), but with persistence.

  • structured documents (schemaless)
can attach binary blobs to documents
  • RESTful HTTP/JSON API (to write, query)
so you could do with little or no middle-end (you'll need some client-side rendering)
  • shards its data
  • eventually consistent
  • ACIDity per doment operation (not larger, so inherently relational data)
no foreign keys, no transactions
  • running map-reduce on your data
  • Views
best fit for mapreduce tasks
  • Replication
because it's distributed, it's an eventually consistent thing - you have no guarantee of delivery, update order, or timeliness
which is nice for merging updated made remotely/offline (e.g. useful for mobile things)
and don't use it as a message queue, or other things where you want these guarantees
  • revisions
for acidity and conflict resolution, not in a store-forever way.
An update will conflict if someone did an update based on the same version -- as it should.
  • Couchapps,

document ~= row


  • view group = process
nice way to scale
  • sharding is a bit harder


  • not in views
  • if large, consider CDNs, a simpler nosql key-val store, etc.

See also:


Javascript analogue to CouchDB.

Made in part to allow storage in the browser while offline, and push it to CouchDB later, with minimal translation.

Couchbase notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(previously known as Membase) (not to be confused with CouchDB)

CouchDB-like document store, plus a memcached-compatible interface

Differences to CouchDB include:

  • typing
  • optional immediate consistency for individual operations
  • allows LDAP auth
  • declarative query language
  • stronger consistency design


Distributed key-value with optional transactional API

See also TiDB, which is basically an SQL layer on top that makes it NewSQL.


Mixed-model but seems built on (ordered) kv.

Serializable isolation


Distributed key-value store, a fork of Google's LevelDB


Column store


key-value or document store.

stronger consistency guarantees than many other things

supports transactions on more than a single object - making it more ACID-style than most any nosql. By default is looser, for speed(verify)

very interesting performance

optionally applies a schema

Most parts are BSD license. The warp add on, which provides fault-tolerant transactional part, is licensed. The evaluation variant you get omits the fault-tolerance guarantee part.


Document store with a push mechanism, to allow easier/better real-timeness than continuous polling/querying.

Storagey stuff - wide-column style

As wide column is arguably just a specialized use of a kv store


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Distributed wide-column store

Uses CQL for its API (historically that was Thrift)

Seems to scale better than others, though apparently at cost of general latency


People report (cross-datacenter) replication works nicely.

Values availability, with consistency between being somewhat secondary NOTSURE (verify)

(if consistency is more important than availability, look at HBase instead)

See also:


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

An implementation imitating Google's Bigtable, part of Hadoop family (and built on top of HDFS).

See also:

  • stored on a per-column family basis


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

See also:


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Storagey stuff - graph style

This one is mostly about the way you model your data, and the operations you can do, and do with fair efficienct. that you can use e.g. key-value stores in a graph-like ways, and when you don't use the fancier features, the two may functionally be hard to tell apart.


Apache Giraph







Cachey stuff

redis notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


  • key-value store
  • primarily in-memory
though allows persisting by
logging requests for recovery that way (AOF), and/or
snapshotting current state, e.g. every hour (RDB)
...but think hard before considering it a primary store
  • typed - types/structures include: counter, list, set, sorted set (hash-based), hash, bitarray
that you get a basic set of queries/operations for
  • geo data types
  • can be sharded [13]
in-client mapping (so must be consistent if you use it as a store rather than a cache!)

  • allows transactions, which lets you do your own atomic updates when necessary
  • pub/sub


  • memcache of sorts (session data, common resources)
  • intermediate between services (see pub/sub)
  • rate limiting between

redis is also sometimes used like a persisting sort of memcache (serializing objects into strings instead of using data structures)

It is aimed at structuring well enough that all operations can stay simple and fast, which is roughly why it scales well.

Complex data structures is something that takes some informed designing.

Not really made for finding items by anything other than their id, so indexing is also something you would design yourself. [14]

On commands:

  • most/all
SETNX - set only if did not exist
SETEX - set value and expiration
  • bytestrings
GETRANGE (substring)
  • bitstrings
  • integer / counter
  • float value
  • list
LRANGE (fetch by index range)
LTRIM, RTRIM (throw away al)
BLPOP, BRPOP - block until present (until timeout, waiting clients served in order, can wait on multiple lists -- all behaviour useful for things like producer-consumer pattern)
RPOPLPUSH, BRPOPLPUSH - may be more appropriate when building queues
  • hash
HMSET, HMGET (multiple)
  • transactions:
MULTI, ended by either EXEC or DISCARD


memcached is a memory-only key-value store

  • evicts items
by least use when memory limit is hit (LRU style, geared to keep most-used data)
by explicit eviction time, if given
  • client can shard to multiple backends - an approach that has some some footnotes
  • no persistence, no locks, no complex operations (like wildcard queries) - which helps it guarantee low latency

It is not:

  • storage. It's specifically not backed by disk
  • a document store. Keys are limited to 250 characters and values to 1MB
if you want larger, look at other key-value stores (and at distributed filesystems)
  • redundant
You are probably looking for a distributed filesystem if you are expecting that. (you can look at memcachedb and MogileFS, and there are many others)
  • a cacheing database proxy
you have to do the work of figuring out what to cache
you have to do the work of figuring out dependencies
you have to do the work of figuring invalidations

Originally developed for livejournal (by Danga Interactive) and released under a BSD-style license.

See also

Searchy stuff


See also Lucene and things that wrap it#ElasticSearch

Note that it is easy to think of Lucene/Solr/ES as only a text search engine.

But it is implemented as a document store, with basic indices that act as selectors.

So it is entirely possibly to do more structured storage (and fairly schemaless when ES guesses the type of new fields), with some arbitrary selectors, which make it potentially quite useful for for analytics and monitoring.

And since it adds sharding, it scales pretty decently.

Sure, it won't be as good at CRUD operations, so it's not your primary database, but it can work great as a fast structured queryable thing.

Time seriesy

Time series databases are often used to show near-realtime graphs of things that happen, while also being archives.

They are aimed at being efficient at range queries,

and often have functionality that helps


Is now part of a larger stack:

InfluxDB[15] - time series database
Telegraf[16] - agent used to ease collecting metrics. Some pluggable input / aggregation/ processing things
Kapacitor[17] - streams/batch processing on the server side
Chronograf[18] - dashboard.
also some interface fto Kapacitor, e.g. for alerts
Often compared to Grafana. Initially simpler than that, but more similar now

Flux[19] refers to a query language used in some places.

InfluxDB can be distributed, and uses distributed consensus to stay synced.

Open-source, though some features (like distribution) are enterprise-only.

Data model notes
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

In comparison to a more relational view...

  • database is a logical container for
retention policies
time series data
continuous queries

  • retention policy (RP) contain:
replication factor (copies kept in the cluster) (autogen's default is 1)
retention - how long to keep the data (min 1h) (autogen's default is infinite)
shard group duration - how much data is stored in shards (min 1h) (autogen's default is 7d)
measurements - each measurement is implicitly part of the retention policy you put it in
each database can have one or more RPs
you get a default called autogen (defaults mentioned above)
you'll quickly notice them in addressing (testdb.autogen.measurementname) though can ignore everything about them at first

  • measurement are like a table, containing tags, fields and points
  • series
basically refers to a (measurement,tag) combination you'd likely use in querying -- see below
  • tags are key-value pairs (column-like) you can add per datapoint that are
part of a series' uniqueness, and is indexed
basically, whatever you need for lookup
limited to 64K (and you probably don't want to get close without good reason)
  • field are key-value pairs (column-like) you can add per datapoint that are
not part of its uniqueness, not indexed
basically, any values you may be looking up
types include float (32-bit), integer (64-bit), boolean, timestamp, or string
a measurement takes the type of the first value it gets (and currently cannot be changed except with some creativity[20]), so e.g. forcing integer (add
) over float is sometimes necessary, e.g. to store large values without losing precision
keep in mind that a field will not take different types over time, even if it might be fine, so being consistent per measurement is a good idea. You'll see errors in the log like
field type conflict: input field "ping" on measurement "net" is type integer, already exists as type float
strings possibly not limited to 64K? (I've seen conflicting information)
but you probably don't want to use influxdb as a blob store if you want it to stay efficient
you can check current type with show field keys
  • (data) points

always have a time(verify)
there is always an index on time
time precision is another detail

Typical use of measurements, series, tags

Say you want to start keeping track of CPU use and are collecting it for various datacenters (various tutorials use an example like this).

You might have a

  • specific database for this purpose (for admin reasons)
  • retention policy mostly because you want monitoring stuff deleted after a year without thinking about it
  • measurement called host_monitor

and want to enter a datapoint with

  • tags like hostname=node4,datacenter=berlin,country=de
  • fields like cpu0=88,cpu2=32

You'll notice this is a pile of everything CPU-related.

The tags are structured with common uses in mind, often the coarsest and finest things you anticipate querying on - you can e.g. pick out and e.g. average per country, or pick out a particular host if needed (hostnames only unique within datacenters - i.e. combined with other tags).

Series are basically the idea that each unique combination of (measurement,all_tags) represents a series.

Data you send in from different places will often imply unique series, through having unique tags, though to some degree they are more of a querying concept, and a storage one only insofar that the indexing helps that.(verify)

On point uniqueness

A point is unique (measurementname,tagset,timestamp), so if you write when a record with that tuple already exists, field values are merged/overwritten.

Depending on the timestamp precision you hand into the ingest url, this

  • may never happen, if it's based on 'now'
  • may never be an issue, if the precision is much higher than the interval you send in
  • may be something you do intentionally

On timestamp precision

Timestamps are currently nanosecond resolution by default. This can be reduced to microsecond, millisecond or second.

Lower-precision timestamps lead to

  • overwrites data with the same timestamp (see previous point)


  • Is that per database, series, datapoint at insertion time?
  • Does it mix precision if you alter precision over time?


/query queries, management /write ingest, takes line protocol

/ping health and version

/debug/pprof Generate profiles for troubleshooting /debug/requests Track HTTP client requests to the /write and /query endpoints /debug/vars Collect internal InfluxDB statistics

The line protocol[21] is a one-liner text presentation that looks like

measurement,tag_set field_set timestamp


tag_set and key_set are comma-separated key=val pairs
timestamp is nanosecond-precision Unix time
(also optional; defaults to local timestamp, UTC, but be aware of
clock drift (so you likely want NTP)
timezones (so have a serious think about using either client time or server time))

Clients may ease conversion of structured data to line protocol.

On shards and shard groups

Under the hood, data is stored in shards, shards are grouped in shard groups, shard groups are part of retention policies.

This is under-the-hood stuff you don't really need to know, though it may be useful to consider in that shard groups are related to

  • the granularity with which old data is removed because of retention policy (it's dropped in units of shard groups - so never immediately)
  • efficiency of the typical query, e.g. when most queries deal with just the last one and the rest is rarely touched, essentially archived and rarely or never touched by IO
Config notes

Logging is great for debugging, but is both verbose and (unless all its clients POST instead of GET) puts possibly-sensitive information in system logs.

In which case you probably want to set log-enabled = false in [http]

Note that if you put a HTTP server in front, the same may apply there. If apache, look at something like

 SetEnvIf Request_URI "/submit" dontlog 
 CustomLog /var/log/apache2/access.log combined env=!dontlog

Security notes

Host security

Because it's designed to be clustered, it serves on all interfaces by default (and names should be resolvable).

On a single-node installation you could choose a localhost-only via bind-address in the [http] section, which you'd want as

bind-address = "localhost:8086"    # default is ":8086"

(There used to be an admin web interface on port 8083, but this has been removed[[22]]. You now probably want to use Chronograf)


For similar reasons, by default there is no authentication[23]

you may wish to firewall things at IP level
if you want auth, you need to enable security and create users (see below)
auth happens at HTTP request scope, e.g. for the API and CLI
certain service endpoints are not authenticated
  • can do HTTPS itself [24]
you can get server certs and client certs - and use self-signed ones if you wish
note that in a microservice style setup, you may wish to do this on the edge / ingest sides instead

User authentication

Enable: set
in the
section and restart

You can

  • use basic auth
  • hand in
    sername and
    assword in the URL or body
  • JSON Web Tokens

If nonlocal, it's recommended you use HTTPS, because all of these options are effectively plaintext.

User authorisation

New non-admin users have no rights. They can be given

ALL (meaning READ and WRITE)

per database

New admin users have a lot more granularity, like

  • user management
Querying notes

Query languages

InfluxQL - an SQL-like language [25]

Flux - a more featured language [26]

InfluxQL examples (you may want to run the cli, e.g.
influx -precision rfc3339
where that argument is for human-readable time formatting)
USE testdb

A simple query would be

SELECT "eth0_rx", "eth0_tx" FROM "pc_monitor"

would be all data of that series that we have

Queries like

SELECT "cpu_used" FROM "pc_monitor" WHERE time > now() - 15m

but when you want a timeseries you often want to regularize it like:

SELECT mean("cpu_used") FROM "pc_monitor" WHERE time > now() - 15m GROUP BY time(1m) fill(null)

This adds a time interval (1m), what to do with multiple values (aggregate into the mean)

On fill(): GROUP BY time() creates regular intervals(verify), so it has to do something for intervals with no data. Options:

null: return timestamp with null value (default)
none: omit entry for time range
previous: copy value from previous time interval
linear: linear interpolation


Getting the most recent value


SELECT last("cpu_used") FROM "pc_monitor" WHERE time > now() - 1h


last() aggregate is what it sounds like
you want a time limit, to avoid selecting the entire time series for that to
you probably want that anyway, when you care to view something current

For a gauge, you may want a recent average, like:

SELECT mean(cpu_used) FROM "pc_monitor" WHERE time > now() - 5s group by time(5s) ORDER BY time desc


  • because this compares against now, the most recent interval that GROUP creates may not have a value in it yet, meaning you'll get

SELECT LAST(eth0_tx) from pc_monitor

SELECT LAST(field_name), * from test_result GROUP BY *

GROUP BY * effectively separates by series

SELECT LAST(*) from pc_monitor group by *

Practically similar to

SELECT * FROM "pc_monitor" WHERE time > now() - 15s ORDER BY time desc limit 1

Though you may like some averaging, like

SELECT mean(cpu_used) FROM "pc_monitor" WHERE time > now() - 5s group by time(5s) fill(none) ORDER BY time desc limit 1

Note that without the ORDER BY time desc limit 1 you'ld probaly get two time periods (at least until the group time is at least twice the selection time)

Dealing with null

See also:

Management notes

Deleting data

If this is about removing too-old data, the never-think-about-it approach is to set up retention policies.

...but yes, you can do things like:



all data and series from a measurement
DROP SERIES FROM "net" WHERE hostname='myhostname'


drops all series that apply

DELETE FROM "net_monitor" WHERE hostname='myhostname' and time < now() - 1h


the delete granularity is effectively measurements (not tags) (verify)
this won't delete the series, even if it removes all points

DROP SHARD shardid


you'd probably get the shard id from
show shards

Browsing data

Use the CLI, something like something chronograf or grafana.

There used to be an interface [ this?]

CLI example:

> use foo
Using database monitor
> show series

Backup and restore

influxd backup -database name -portable

Storage size

Because of the compression done to older data, and the often-quite-compressible nature of time series, most monitoring needs don't really need to worry about space use.

This of course does scale with the amount of counters, and the time resolution of insertion.

For example, in a 70 day test with dozens of counters inserting in intervals between 2 to 300 seconds, space sawtoothed (because the compression is staged) up 200MB, from 500ish to 700ish.

Not nothing and on embedded you probably still want rrdtool, but also nothing to worry about on, say, raspberries or small VPSes.

Chronograf notes

Separate install and binary, so needs to be pointed at a InfluxDB instance


Tied to Postgres



See Data_logging_and_graphing#Graphite_notes

Storagey stuff - NewSQL

NewSQL usually points at distributed variants of SQL-accessed databases.

So things that act like RDBMSes in the sense of

giving of SQL features (not just a SQL-looking query language),
providing transactions,
and as ACID as possible while also proviiding some amount of scaling.

NewSQL is in some ways an answer to NoSQL that says "you know we can have some scaling without giving up all the features, right?"


  • distributed SQL
PostgreSQL API
Cassandra-like API


MySQL interface
tries to be OLTP and OLAP
handles DDL better?
in spired by Google Spanner
see aksi TiKV - TiDB is basically the SQL layer on top of TiKV (verify)



Unsorted (NewSQL)

Aurora (Amazon-hosted only)

MySQL and PostgreSQL API to something a little more scalable than those by themselves


basically BigTable's successor


See also: