| Pieces of larger or complex systems
| Database related
- 1 Describing database systems
- 1.1 On database types
- 1.2 Data consistency and versioning, and its concepts
- 1.3 Broad qualifiers
- 2 Some such databases, by type
- 2.1 File databases
- 2.1.1 key-value file databases
- 2.1.2 Relational file databases
- 2.1.3 Array oriented file databases
- 2.1.4 Unsorted (file databases)
- 2.2 Storagey stuff - kv or document stores
- 2.3 Storagey stuff - wide-column style
- 2.4 Storagey stuff - graph style
- 2.5 Cachey stuff
- 2.6 Searchy stuff
- 2.7 Time seriesy
- 2.8 Storagey stuff - NewSQL
- 2.1 File databases
Describing database systems
On database types
Block stores are access to one big series of bytes.
In moderately-sized blocks at a time, not byte level.
May just mean it's exposing a large disk at low level - SAN tends to mean storage exposed this way.
In cloudy setups, it tends to mean 'connect a disk, let us figure out the details' (which may be SAN style under the hood).
The most 'do everything yourself' variant of storage.
This is mostly relevant as an underlying/intermediate layer, e.g. VMs may want their storage on a block device, though even that is usually handled through some existing logical volume management.
There are almost no use cases where users or services want a block device directly.
Object stores often mean storing arbitrary-sized blobs of data, under arbitrary names.
A blob store, today, usually refers to object stores -- though also, 'Binary Large OBject' still has relation to doing this in databases where it was perhaps less of a fit.
Like a key-value store in many ways:
- very basic basic operations: mostly store, fetch, and delete, and maybe list
- no structure to the names
- object stores may allow large objects (gigabytes or more), kv may limit values to a few kilobytes
- roughly so that you use kv for short term, OLTP-ish stuff, and object stores for bulk storage and archiving stuff
- object stores are more likely to have list operation, where kv may not
- object stores usually let you separately fetch metadata
- object stores may have access control per item (and per moderate unit), kv often has none
- object stores may avoid transactions (there's typically no need), and be more eventual-consistency, particularly in designs that want to scale well
- as may kv.
For example, S3 is an object store much more than a kv store - though .
It's worth pointing out that a filesystem is a database.
Code (usually kernel code) manages all the data structures, it has strict semantics that ensure things happen in an atomic, consistent way.
...well, sort of. It turns out it's really a relatively poor database, for a slew of reasons.
- many things are not transactional, meaning you need a lot of error handling
- A treewalk can find itself erroring out on something that it was just now told existed.
- This is where NoSQL wouldn't help much either - they rarely even try or pretend to be transactional.
- Tree restrictions that are hard to keep
- say, filesystems tend to have a limitations like '255 characters in a filename' and '32667 characters in a path'.
- so what happens when you're a few bytes under the latter limit, and you move a directory into a subdirectory so that directory doesn't break that limit, but files in it now do?
- if that move is just "directory parent is now this other directory", this is often not checked
- which means it will work until some API calls break
- but if there is another, it might not.
- the errors tend to be quite unhelpful
- Text encoding is usually... breakable.
- In bytes-based unices, which are conventionally UTF-8, it's easy to get invalid byte sequences.
- In windows, which are mostly UTF-16, it's entirely possible to get invalid sequences.
- Languages that want to interface with both linux and windows have to do even more creative things
- I have one file on my windows drive that I haven't been able to delete in years. Explorer cannot do it. Probably something can, but I'll be damned if I can find even the reason it exists this way
As such, APIs that want to be portable have to wrap two different existing APIs creativity, and somehow deal with the fact that the underlying filesystem might well allow incorrect data.
- Tree structure can be weird
- for example, symlinks and links may mean that what you previously thought was a tree is actually an acyclic graph or even cyclic graph instead.
- the last meaning that treewalking is technically infinite, except it will error out on length at some point
- there are some practical cases of unnecessarily bad algorithmic complexity
- consider e.g. an ls that goes through items and stat()s each, when that stat happens to be implemted in a way that makes it linear-time lookup. That makes a simple directory listing quadratic time with the amount of entries
- and trust me, windows is no stranger to being slow as molasses when viewing more than 20K or so things in a single directory.
The POSIX interface (and the windows one too) is defined very serially.
Unnecessarily so, but thirty years ago it mostly worked fine, and now we can't change the semantics.
which is why distributed filesystems will point out their own, non-standard API.
- metadata isn't centralized / indexed, so most queries are inefficient
It's great for robustness against local errors, sure. And not an issue for many specific accesses. But very poor for others.
Consider that a treewalk on a POSIX style filesystem is tens to hundreds of thousands of syscalls, that also have to happen in absolutely strict sequence because the semantics require it. Not even the much smaller latency of SSDs will help much here.
That said, 'list everything you have' is also something you never want to do with any online stores.
Example: I have one directory structure for a bunch of structured data - 500K files, 19K directories.
When the metadata is not cached in RAM, it takes eight minutes just to walk through and just list them.
- That's just stats, directory-opens, getdents, because that is the only way to fetch that structure
When it is cached, it still takes three minutes
I have that same directory listing mirrored in a database. It's milliseconds in total to do most things I care to do. And a few hundred milliseconds for rare things. Even a total dump only takes seconds.
When you work with things like S3, you'll probably find yourself reflecting how this is sort of file storage in that buckets are like folders, objects are like filenames that refer to arbitrary-sized blobs of data, but without things that can go wrong at directory-structure or weird-filename level.
The old standby.
Also great at structure, generally the best consistency managers we have, and allows things like relational integrity that can actually be enforced.
The consistency is not due to their relational nature, in fact that makes things a little harder. It's just that that was one of their design goals, because they used to be the only real choice in serious database.
That same consistency is what fundamentally puts a limit on their ability to scale. (even loosening the consistency management only makes them somewhat faster. It is their overall model that doesn't scale)
NoSQL scales better in part because its guarantees are simpler, and less strong.
NewSQL is an interesting new balance between the two, but comes with footnotes that you really should know.
...though things like Apache Derby (previously IBM Cloudscape) can be interesting if you want to, basically, embed a small (~3.5MB) RDBMs within the same VM as your Java application. (though it can also be set up client-server style) Or, perhaps more likely, something does this to give you standard SQL OLTP without requiring you to install and set up something more complex.
You ask for a value for a given key. Typically no structure to the data other than your interpretation after fetching it
...so they lie very close to object stores and blob stores (the last basically file stores without filesystem semantics).
When this fits your data and use, these are often low-bother ways to store a whole bunch of information, often with some corruption recovery.
If you have something in a RDBMS where you are actually mostly retrieving by primary key, and doing few or no joins (which may include things like simple data logging), you could use one of these instead.
Disk-backed or not Various file-based stores (see e.g. File database notes) are effectively disk-based key-value stores.
Since they are often indexed, and cached in some way, you may easily think of them as hashmaps that happen to be large and stored on disk.
Used more than you may think; Berkeley DB is used in many embedded setups.
There has been some revival in this area. For example, Tokyo Cabinet is basically a modern, faster reimplementation of the dbm. (with some extensions, e.g. Tokyo Tyrant to have it be networked, Tokyo Dystopia to add full-text search).
Whenever each value is completely independent from all others, you can spread these to roughly any scale you want, by putting then on any amount of machines and not even telling the others about it/
Once you want things like altering multiple things atomically, though, you get one of three answers:
- haha no.
- not with acceptable performance
- 'eventual consistency probably' is good enough, right?
key-value where the value is structured data, often not following a strict schema.
It is also frequently possible to index on these fields
Often presented as JSON (or XML, though XML databases can be considered a specific type of document store).
In contrast with e.g. relational data, documents are often altered individually, delivered as-is, and not heavily linked.
You can often think of these as key-value stores with larger values and a nicer interface to change them.
In that documents are whole things and mostly independent - you get no relations the database knows/cares about, and any indexing mostly just implies a little overhead to your CRUD.
The difference is that it is easier to expect documents to act as rows, and maybe some transactions.
The answer to transactions is still often one of:
- haha no
- not with any speed
- eventual consistency is good enough, right?
- though you can often make all alterations to a single document happen at once
- and some will make that 'eventual' sooner than others
Regular RDBMses store data by rows, which makes sense when rows represent individual things you usually handle one by one.
A column-oriented DMBS basically store columns at a time, which makes sense when you're typically working on columns, e.g. primarily doing aggregations, summaries, statistics, or analycics on what works out as a column at a time (with less focus on data relation and structure).
It also tends not to need indices for these applications.
There aren't really NoSQL column stores I know of, but it's good context for wide column stores.
Wide column store
|This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)|
A wide column store, sometimes extensible record store, could be thought of a key-value store where you make the key more complex.
This e.g. allows
- modeling a single matrix-like table, sparsely
- modeling EAV-like storage and access (more scalably than putting it in relational)
For example, if you happen to have tabular data, you might use keys like:
(row,col) -> data
If the table is small and you usually need all of it, this would just be a very inefficient way of accessing it
It's also not very flexible about addressing - when you don't and can't actually tell it it happens to be storing a matrix, so it won't be able to tell you its size, or give you any help accessing ranges in it.
- if the table is huge and/or sparse and/or you typically need small fragments at a time
- if you have a simple way to address the parts, or adapt to the minimal tools a particular store does give you
this becomes interesting.
Wide column stores actually vary significantly on how the data model works in practice.
It's an interesting exercise to bolt this onto any key-value store yourself, as you'll quickly figure out what kind of operations are hard to do efficiently or at all.
Then maybe compare with things like Cassandra, which tries to be as table-like and SQLish as it can manage.
You can see this as a data model pasted on top of basic key-value storage, meaning the storage is very easily distributed by nature - unless your data model breaks that.
Each implementation has a data model that add a few properties to the way data is organized, and accessed, to make certain uses more practical or efficient.
In particular, various have the idea of column families (making them more structured than e.g. EAV would suggest)
For example, bigtable chooses
- read/write under a row key are atomic
- data under a row key is ordered
...which helps related sites be on similar nodes, helping access speeds
- columns are grouped into column families
- which makes them more structured under the covers (and helps access efficiency)
Bigtable suggests you store and access things with keys like:
(row:str, column:str, time) -> data
For another example, cassandra has partition keys, clustering columns that helps direct how data is stored.
primary key is partition keys + clustering columns?
From another view, each row still maps to many columns that could be selected. Part of the access efficiency comes from the fact we need only select the columns we are interested in, and this is part of the lookup, not the filtering.
Wide column stores are, by themselves, just extensible record stores, but various implementations add additional properties
Examples: Bigtable (proprietary) HBase Cassandra Cosmos DB (when using its Cassandra API)
Bigtable is a distributed data chunk storage.
It aims to meet a moderately specific sets of requirements, which happened to be useful for google search, gmail, and such.
It aims to serving potentially large chunks of only loosely structured data, with low latency where possible/necessary.
It actually is a moderately specific a data/access model, with ordering that allows for a B+-tree-like distribution model across various nodes
The data model is described in the whitepaper as "a sparse multi-dimensional sorted map"
- In other words, a map.
- With some fancy implementation details.
In terms of uniqueness, the map is from a 3-tuple:
- row key (string),
- column key (string), and
- timestamp (int64),
to a string value
- row key (string)
(row,column,timestamp) -> value )
(Note that to Bigtable itself, these values are uninterpreted bytes. It is up to clients to give it a meaning, as necessary, e.g. serializing )
This is not the way you access Bigtable, and most real-world applications of Bigtable are neither quite like a map or like a row-column table. Both of those can be useful assisting mental models, but you are likely to use it quite differently. Note that columns are weak versions of RDBMS-style columns at best.
(Note that you can use the bigtable setup in more than one way, because there is no predefinement or enforcement type of schema, only one implied in use.)
Consider for example the white paper's example is to store, for each URL, the page contents and sites that linked to this one (data you might want want for some particular linkage experiment or so, and would generate from crawl data).
For example, if foo.example.com and bar.example.com have links to www.example.com, you would have stored items probably at least including the following:
('www.example.com', 'anchor:foo.example.com', 0 ) -> 'link name used from foo' ('www.example.com', 'anchor:bar.example.com', 0 ) -> 'link name used from bar' ('www.example.com', 'contents:', 0 ) -> 'whole_html_page' ('foo.example.com', 'contents:', 0 ) -> 'whole_html_page' ('bar.example.com', 'contents:', 0 ) -> 'whole_html_page'
In this case, anchor is the column family you are using to map from link URL to link name.
In comparison to RDMBS, you would have a few options, and only adding an anchors table to the schema is even vaguely sensible (a compounded list in a single column would not scale, and a column per site would be crazy because of the way that is stored)
In comparison to maps, well, you can indeed use a basic a map like the above, creating a key single key from the elements above.
Except, of course, Bigtable rows and columns have meaning, in terms of querying, management, access control, (and also in implementation details, such as that compression happens per column family), and such.
Mutations happen atomically at row level. You can open a row as if it were a record, and add/alter/delete columns from it, then apply (store).
Often used with
that imitates a row-column setup at scale (and often with column sparsity)
- (with the implication that each cell can be formatted differently, but you usually don't want that)
Depending on how you use this, you could store something imitating tables, rows, and columns, but implements it as a variant on a key-value store.
- unlike classical databases in that it doesn't store rows as units,
- unlike column stores in that it also doesn't store columns clustered
- though wide-column can be hinted to do that if possible
- sometimes lead to very compressible storage
- you may have higher per-cell overhead, but also be fairly predictable
- for extremely large stores, this scalability will often make up for (for reasonable queries)
- the column family stuff (hinting and grouping) can make a lot of difference
- which is great if you know how to use it
- and less great if you don't
- at smaller scale, it tends to be less efficient.
- join and queries involving multipe attributes are often less efficient
- writes may be less efficent
- Consider e.g. that updates for multiple things in a 'row' has to go to ten different places, and in a classical database just to one.
So arguably, this is more of an OLAP than OLTP thing.(verify)
A graph in the mathematical sense, of a node-link structure.
Certain types of data are very handy to see this way - sometimes because it is a good fit, sometimes because it is structured but still free-form (so you don't end up with 100 relation tables and a headache of joins, and can be much faster when you typically look for very small parts of that data)
While generally considered an index based on a primary data store elsewhere, that also makes that searchable index the thing you actually use.
Plus there are projects that do both.
Message queues are useful to send messages when more than two endpoints are involved.
Not a storage model, or NoSQL, they just turn up in the same places because they are really useful tools when you're scaling things up, be it around clusters/swarms, whether microservice style or not,
Also, some of the newer implementations made it a it a lot easier to interchange not only between threads and processes, but also between hosts and even datancenters with less work.
Details beyond that vary a lot, like
- delivery model
- ordered (e.g. fifo) or not
- single consumer or broadcast
- one-way or two-way
- awareness of balancing
- whether things are removed only once acknowledged (useful for job queues)
- whether there are priorities in the queue
- whether it's guaranteed low-latency or not
- whether delivery is guaranteed (and whether you get a list of things that failed to deliver)
- whether messages are federated
- whether messages are kept temporarily or forever,
- whether it's working from a central broker or not,
- whether it's guaranteed low-latency or not
- whether security is a concern
Data consistency and versioning, and its concepts
|This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)|
ACID (Atomic, Consistent, Isolated, Durable) is a set of properties you want to guarantee that the data it shows you is correct and consistent and up to date at all times. 'ACID compliant' an be intuited as 'a guarantee that my data is safe and correct'.
For a large part, that is about how a sequence of interactions is handled, in particular
- dealing with sequences of statements that must be applied all together, or not at all: transactions
- parallel execution
- because most real practice has more than one user
- because it make most parts a lot harder (arguably, there should be a P in the acronym, because parallelism is usually the part that introduces the most complexity, edge cases, and tradeoffs, and is sort of important to understood well)
- roughly: when the backend acknowledges a request, this change will survive, and you'll get it back as-is, even around various typical malfunctions
- in part a 'guaranteed to be actually written to disk', in part about recovery, in part about ensuring other ACID parts during recovery
- and implies implementation details like "write a new copy, then retire the old one" style journaling (see also e.g. WAL) is a popular solution - instead of e.g. "update/overwrite the existing data as you go" because while that can often be a little faster, failure halfway through might make that unrecoverable)
- roughly: data respects all imposed integrity rules (constraints, cascades, triggers, etc.) at all times
- attempting a operation that isn't allowed will be rejected, and will never lead to showing invalid state to clients at any time
- if the transaction turns out to violate a constraint by one of its later actions, it must be rolled back
- this is basically impossible to do without atomicity and/or isolation
- roughly: a set of changes either happens fully, or not at all
- since many operations in a transaction aren't even individually atomic, this certainly requires extra bookkeeping (e.g. of before and after states)
- the easiest method is locking, but this also puts a very hard limit on concurrency. Concurrent methods are more involved.
- transactions should not interfere with each other when concurrent
- ongoing interactions are invisible to each other; even when actually handled in parallel, they should look to any client as if they were done sequentially; what one transaction alters is hidden from all others
- largely means they cannot react to new state from another transaction
- required for consistency
- this conflicts with concurrency (more fundamentally than some think - varying somewhat with how complex the data structures are)
- (at the very least, things become more complex when the data they deal with overlaps in any way)
Even with those words, and many more from the articles they came from, ACID still remains a fairly open-ended description, not a set of rules of how to implement that.
And if some of these seem to overlap, that's because they do.
Which leads to arguments like "is what isolation says already covered by atomicity?", so it's probably there because we planned to relax it from day two.
In fact, it seems to have been added later indeed;
- ACD from J Gray (1981) "The Transaction Concept: Virtues and Limitations"'
- ACID comes from T Haerder and A Reuter (1983), "Principles of transaction-oriented database recovery"
One way you can see the A/I difference is that
- atomicity is about all of a transaction's writes becoming visible atomically as table state
- isolation about different reads in another transaction not seeing different versions at different times within it
- one focuses more on correctness, the other more on ordering
- consider you could write a system that switches in one transactions's new state perfectly atomically, and have another transaction read from the earlier state at one time, and the new state later (Nonrepeatable reads, see below)
These details is also one reason SQL defines isolation levels, which actually relax from strict ACIDity.
So yes, real-world relational databases are typically not configured to be fully ACID (and a few can't even be).
This is also roughly why a lot of NoSQL often does not care about strict transactional correctness.
ACIDity of RDBMSes versus NoSQL
ACID is historically largely related with RDBMSes.
Not because ACID means RDBMS - ACID is just a set of principles, without even saying how you'ld implement them.
But because RDBMs's features means that without considering ACID in your design, you'll make a mess.
NoSQL is different. The same relaxation of the of the data model that allows it to scale also lowers the amount of guarantees - that even have to exist.
With no enforced relations, and generally no transactions,
- Atomicity is almost implied if you are restricted to only 'get' and 'put' single items.
- and it is generally not guaranteed on bulk operations. They'll say something like 'best effort'.
- Isolation is irrelevant (no transactions)
- Consistency has almost no rules to adhere to (beyond rejecting nonsense requests)
- (Durability is a separate chat)
So ACID is sort of... irrelevant on most NoSQL.
To even name it suggests you consider it to mean "operations probably not incorrect" rather than transaction correctness (which NoSQL generally doesn't even try)
- ACIDity of single-row (or key) alterations and
- ACIDity of multi-row (or key) alterations.
In RDBMSes, multi-row ACIDity is required to be taken remotely seriously.
Most NoSQL isn't multi-row ACID because there are no multi-row alterations in the first place. You often don't get to group alterations in transactions that get applied atomically.
So they're ACID because they're single-row ACID.
The issue comes in when they call themselves ACID because they're single-row ACID, but they also happen to allow multi-row updates.
If you do that from your code, the result is not ACID overall. You don't get that guarantee.
And we would really like multi-row updates. Particularly around denomralization, it would be great if we can say "Update this related set of denormalized calculated fields, either all or none, because that means the app won't show something inconsistent". By and large, we cannot. We are running either on 'hope for the best', or something inbetween because some BASE implementations are pretty neat.
But technically we weren't lying.
Also, Good luck knowing for sure.
The D is sort of a separate case.
Durability doesn't mention time.
In RDBMSes, it often means that 'if I acknowedge, it means it's guaranteed to be actually written to disk'. Be it tablespace or recovery log, you'll see this data back unless something goes unusually wrong.
NoSQL is split between
- Write-Ahead Logging like that, and
- "sure, we save our in-memory state to disk every few minutes"
- (and a few "neh, we're in-memory only", which is fair)
And in a lot of use, it's a fine tradeoff. Just know that you're making it, and that this might be a bit of a stretch of D.
Aside from NewSQL, there are few NoSQL cases that try to give you multi-row ACIDity (or detail the cases).
You can ask "Is S3 ACID?" and get completely different answers, including
- basically yes, the individual operations are very predictable
- It only does get, put, list, and delete.
- get and put of completely independent items barely gets into any consistency.
- list may not reflect the objects present a second later, but you would only expect that if there were transactions, which there are not.
- No. It's not a database, it's an object store. Your question doesn't make sense.
- No. It does not give read-after-write consistency
- (this changed in 2021, basically after 15 years, and still seems slightly qualified)
- Maybe; you could add something like transactions on top,
- but it'll probably work out as batches and fallback cleanup. Close enough, right?
- You still don't get multiple such interactions working in a serialized way. No I or A, really.
- But hey, you generally wouldn't expect to.
Also, if you use S3 as a backend to a metadata store elsewhere - which is common and convenient - that's even less likely to stay consistent with each other over time.
|This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)|
ACID is great theory, but it turns out that most parallelism will easily be faster only under some conditions, and in various real-world ones things can still come down to "do things largely in series, or it becomes way too complex to handle".
So another question pops up: Do you want a tradeoff between which guarantees we make, for extra speed?
Isolation levels are a controlled way to loosen strictly ACID behaviour in RDBMSes.
Roughly speaking, isolation levels control what sorts of non-ACID behaviour is allowed whenever you don't do explicit, fairly complete locking.(verify)
Three things you want to avoid (at least, the most mentioned and relevant):
- Dirty reads - refers to fetches that ignore any transaction or locking
- Means we don't have to wait at all - great for speed!
- May e.g. mean one connection may read data that is not yet committed by another
- when multiple clients are writing relational data that has to make coherent sense, you can assume this will go wrong at some point
- Nonrepeatable reads: getting different values for the same row when trying to read it multiple times in the same transaction
- ...e.g. because of another transaction's committed alterations. Reads can be non-dirty, but still non-repeatable.
- We usually care about nonrepeatable reads refer to those within the same transaction. There is also the concept of "nonrepeatable within the same query". The fact that we use the same word for both is confusing, yes.
- Phantom behavior / phantom reads refers to transactions 'magically' seeing new rows
- i.e, seeing that the result set changes (rather than that specifically identificable row values change)
- e.g. caused by WHERE clauses that select from another table at different times, which can find new matching rows after they are committed by something else.
- dirty reads are reading non-committed data from another transaction
- nonrepeatable reads are often because another transaction COMMITted an UPDATE
- phantoms are often because another transaction COMMITted an INSERT or DELETE)
'Approximately', because with more scrutiny there are a few more specific cases .
You don't have to think about those terms directly, because SQL-92 defines isolation levels:
- READ UNCOMMITTED
- Basically 'anything you find will do', does not use/observe read locks
- allows dirty read
- allows nonrepeatable reads
- allows phantom behaviour
- READ COMMITTED
- Won't read uncommitted data, doesn't ensure that read rows won't change within the length of this transaction
- avoids dirty reads (only)
- allows nonrepeatable reads
- allows phantom behaviour
- "Won't read uncomitted data" and "allows nonrepeatable reads" seem to conflict. The point lies in the difference between a query and a transaction. Nonrepeatable reads in a query means it will not see different values if it needs to read a row twice. But two identical queries in the same transaction can still see data if another transaction has completely committed data to the relevant table. If you want all state to be repeatable within your transaction, you want either locks, or the following:
- REPEATABLE READ
- Won't read uncommitted data, and does ensure read rows will not change within the transaction. Put another way, you're essentially reading from the same snapshot each time.
- avoids dirty reads
- avoids nonrepeatable reads (the main point, since it's the name of the thing)
- still allows phantom behaviour (...wider locking may avoid it, without being quite the same as SERIALIZABLE. Depends.)
- usually implemented by placing locks on all data relevant to the query
- strictest -- no transaction interaction possible at all
- Tends to be noticeably slower, because it is often implemented with zealous locking, or sometimes even by never handling two transactions on the same object(s) at once(verify).
- avoids dirty reads
- avoids nonrepeatable reads
- avoids phantom behaviour
- Some database engines may offer their own additional levels
- e.g. SNAPSHOT
- oracle and postgres say serializable when they mean snapshot(verify)
- most of these isolations aren't ACID - that's sort of the point.
- Most database engines can be asked to behave to different levels, e.g.
- MyISAM doesn't support transactions at all, so is probably best described as READ UNCOMITTED always
- InnoDB supports all four(verify), and defaults to REPEATABLE READ
- its MVCC nature means it internally has the equivalents of READ COMMITTED and SERIALIZABLE.
- Default is READ COMITTED(verify)
- The other two don't exist, and asking for them actually gives you the next-strictest behaviour:
- READ UNCOMMITTED gets you READ COMMITTED
- REPEATABLE READ gets you SERIALIZABLE
- Note that given its MVCC design, reads tend to spend less time waiting for locks, so looser settings often aren't as useful
- DB2 has its own terms and goes into a few more distinctions and may be interesting reading.
DB2's terms may also be interesting; they go into a few more distinctions
- Repeatable Read (RR)
- Read Stability (RS)
- Cursor Stability (CS)
- Uncommitted Read (UR)
- No commit (NC)
Considerations to allowing dirty reads
Using SET TRANSACTION ISOLATION LEVEL to READ UNCOMMITTED (or e.g. MSSQL's per-query WITH(NOLOCK)), can be used to allow dirty reads to happen, which means the value is read out without locking considerations.
This can be useful when any value currently in there will do. This may be true in the case of social-site profile information, continually updated statistics tables, and such. Any value from now to a few seconds ago will be perfectly good - using locks and listening to transactional accuracy will only really mean more waiting.
Of course, dirty reads may at any time return uncomitted, old, or duplicate data (because you are asking the database to violate ACIDity for the read), meaning you should only do this when you know you'll do no harm, such as when you only read the value to show it and no write/update will even be based on it. (Much the same goes for non-repeatable reads, really)
Note that MVCC-style engines, such as postgresql and MySQL's InnoDB, avoid a lot of classical style locking by isolating data versions, which means there is little speed increase in allowing requests to give you dirty versions.
There are various takes on what happens when systems become scaled up and distributed, in terms of guarantees, and also things like bandwidth and job rate and the things that hold them back, and how such details interrelate.
One of the better known ones, dealing just with the guarantees, is Brewer's CAP theorem, which basically says that of three properties, no matter your answer, you are always choosing a tradeoff.
CAP refers to:
- Consistency usually means "do all nodes see the same data at the same time",
- Consistency in CAP means is mostly more about ordering and timing (this contrasts with ACID, where this concept is more part of atomicity, and consistency in ACID means "follows all the rules and restraints imposed, there often typing, relations, bad values, duplicates, etc.")
- partly "once new state is successfully accepted, we never send older state back" logic, partly agreeing which version is current
- Availability - you will get a response with reasonable time
- a response that is data and not an error
- but there is no guarantee that it is the most recent data
- (note: an answer "regardless of things like node failure" implies that node failure does not halt the system)
- Partition tolerance - system continues even with imperfect communication
- partitioning refers to "parts of a system that don't agree on state"
- dealing with issues like
- "one host fell over due to hardware and then came back"
- "some message got lost"
- and often means the partitioned system needs to be resolved to become one whole again somehow.
- ...automatically if at all possible
- dealing with issues like
These three are entangled to start with.
More so when you add "...within reasonable time", which CAP ignores, and you generally cannot.
The larger the scale and/or the lower the required latency, the more you lean towards accepting you trade away one of these things, in imperfect conditions or at all.
Preferably in a well-informed way.
Some potentially misleading things
- The three concepts aren't quite the same - C and A are about properties that stick around under perfect conditions,
P is specifically about an imperfect condition - partitioning.
- Yet in designs, trying to strongly guarantee two of these comes at the cost of the ability to guarantee the third so well.
- It doesn't become exactly two either. In terms of misbehaviour that you may want, all are a matter of degree.
- The partitioning may be an exceptional thing, or may be a regular thing
- a system may try and manage to be CAP almost always. If it never partitions, there is no good reason to give up CA in regular operation.
Since most distributed systems will try to resolve partitions, and can often do so within reasonable time, this can be made to be fairly rare, at least for everyone not deploying at continent scale.
...historically had no partition tolerance at all. The relational model doesn't distribute, so we tend to make monolithic things that can't partition to start with.
That, and they've been the go to for consistency management.
The ability(/assumption) of communicating perfectly makes it easy to do consistency and availability, as long as the hardware behaves.
So these are classical examples of CA.
You can build something on top of classical databases that does have partition tolerance, but you have to most of the work yourself.
If you assume a system can partition.
Now yes, we do want to know about behaviour when it partitions.
- C+A is a focus for consistency management, lock management, fastish transactioning
- usually doesn't like partitioning - many barely handle re-syncing after a partition.(verify)
- e.g. classical DBMSes
- C+P that under regular circumstances also provides Availability, but when partitioned does not provide Availability
- for large things that act as a primary data store(verify)
- but often basically says "pause until a partition stops happening, e.g. going read-only or unavailable"
- so that it stays deterministic with a simpler model, and more easily rejoinable
- there has been research to make that more flexible
- e.g. mongo, redis, HBase, bigtable, BDB (verify)
- A+P that under regular circumstances can provides C
This isn't a complete view either, because there are lots of gradual inbetweens
(particularly at moderate scale, where resolution may be less troublesome than it is at hyperscale)
- say, some technically fall under one but work better in reality
- A+P may feel scary to do for primary storage
- but consider that, for example, you have heavily nodes, and one falls out. If you have a s
- Say, if all parts of a system can take a good guess at what went wrong, they might be able to
- retry on a different node so that clients don't have to
- inform each other and the client for some automatic reconnects,
- inform the failed node that it needs to update to current state before joining
- reduce the amount of inconsistency to make automatic resolution faster,
leave more of the system in read-write.
Doesn't change the category it's in, but may be a lot better behaved in practice.
In some cases, design choices staple it down pretty specifically.
Various NoSQL is A+P with eventual consistency, often meaning "Yes I updated it locally (and will tell the others as soon as possible)"
In others you can choose one of a a few different points on that continuum. For example, riak's consistency resolution is pluggable, meaning it has a "give me any answer right now" alongside a "do it better within longer".
The example ATMs from the link below is interesting
- you'ld expect it to care about strong consistency
- but that would mean it would pause while partitioned (because you can't know the balance is good)
- in practice, that costs banks revenue, so they basically allow A+P behaviour
- that means you could overdraw in ways you normally can't,
- but this is resolvable in other ways,
- you can limit it by setting the maximum withdrawal lower while partitioned
- so probably fine unless people try to exploit it specifically, and even then the fact that accounts are rarely anonymous so banks have recourse
Sometimes there are tweakable tradeoffs (e.g. between C, A, and P) with settings, or initial-setup choices.
More commonly, it's tied to overall design, and what kind of purpose the system is intended for.
One real take-home may be given that things will go wrong some time, know your actual system's failure modes and how to resolve them
Most people don't seem to, and "the site is down, FIX IT" is not the best time to start learning the theory.
ACID databases focus more on C(onsistency), most people wanting NoSQL focus on the A(vailability), (as well as easier guatantees on time taken when C is about 'is it stored yes/no, don't care about ordering' rather than '...and what about correct ordering, and all its references at all times').
Most systems try to deal with P at least a little, because just because "you disconnected a cable and now everything is permanently broken" is not something you get away with.
That said, there is a difference between dealing with the occasional latency hiccup.
Consensus protocols also rather improve things,
- a very useful thing to have for various things at scale, things like leader elections, atomic broadcasts, multicast, state machine replication (the last being useful to build systems without a single central point of failure)
Consensus protocols/implementations include
Base is a wink at ACID, if much less theoretical .
BAsically available - roughly meaning "won't block much". There will be a response to all requests, but that response is allowed to be
- recent-but-old state (or, when data interrelates, logically inconsistent state).
- 'try again later'
Soft state - you should assume the system might be catching up, even while there is no input.
Eventual consistency: The system will tend towards well-propagated consistency once it stops receiving input.
RDBMSes do what you could now call immediate consistency.
Which basically means that "if I acknowledge an alteration, it's there in my store
Eventual consistency usually means
- "I acknowledge I've received that data and will process it as soon as I can"
This is often before it's written anywhere, so it may hit a conflict, it may be lost in an error.
It usually won't, which is why this is a reasonable cost/benefit thing. The benefit is that we can say yes earlier. The cost is less guarantees.
Eventual consistency also implies data is available in the short run, and consistent only in the long run.
- Eventually Consistent – Revisited
You'll see these two as qualifiers, as kinds of setups, and sometimes as specializations of specific kinds of systems/databases.
- OLTP basically means systems with immediacy, OLAP often means bulk/batch processing
- Systems can be made to do both to some degree, but OLTP often comes with people wanting it to be as immediate as possible
- RDBMSes are mostly (used for) OLTP, NoSQL has less opinion about how it's used
Online transaction processing (OLTP) refers to any system that does day-to-day transactions
- as realtime as possible
- as safely/serially as configured
This usually puts focus on
- operations being simple and fast
- because if things take longer than milliseconds, it's going to scale very poorly, particularly if done serially
- strong consistency
- within systems - where this is simple
- between systems - to the degree this is possible
OLTP that uses more complex data models tends to be implemented in RDBMSes, because the consistency you often want is done for you.
The focus on consistency may mean people defailt to RDBMS anyway, unless they expect scalability issues from that choice.
Online analytical processing (OLAP) tends to refer to "analyse a bulk of data" setups.
This often means that, mostly unlike OLTP
- collecting chunks of data collected over a larger amount of time
- often done for business reasons, rather than direct use by users.
- classically was more about generating reports, analytics, etc. to be looked at later - There are few cases where a response coming seconds later is any problem (or days, if we're talking about reports)
- now more frequently also to be looked at live (scale of seconds), and is starting to overlap with more complex variants of health monitoring
- (compare each to OLTP, where responses are measured in milliseconds)
On mixing the two
When doing OLAP, you could just send queries to your one system that does everything, but when that system is a critiacal, live OLTP system, the bulk nature of OLAP style work could slow that down.
As such, there is a habit to siphon off data from an database used for OLTP to a separate database used for OLAP database (and its own processing host), to work on independently.
There are now also Hybrid transactional/analytical processing (HTAP) is the attempt to do both a transactional database and analytics.
The more at-your-leisure your OLAP work, the more that classic ETL style processing covers you needs, but these days, there are OLAP systems that aim to do real-ish time monitoring, and more explicitly balance scale and speed (also with NoSQL-ish scaling).
- It's often still factors slower than OLTP, or at least behind a few seconds (e.g. when running off a replicated database server)
- yet still many factors faster than generic warehousing, "throw an ETL script at it once every so often", "produce a report" style things),
As a broad qualifier, OLTP and OLAP can describe
- a system as a whole - goals, database, ode, the hosts used
- a type of database - e.g.
- RDBMS is a fairly classical case of OLTP
- most NoSQL (e.g. kv and document stores) are still largely on the OLTP side (if only because few do much processing themselves)
- e.g. ElasticSearch used for log analysis or monitoring (see e.g. logstash, kibana) is more on the OLAP side
- specific queries
- e.g. some complex queries may be OLAP
- queries may get so complex that people do some pre-structuring
It also has some applications to architecture. OLTP often implies tiered applications, whereas OLAP modelings is much more varied, often somewhat custom (though OLAP NoSQL does exist), from monolithic ETL batch script to complex interdependent streaming setups.
Lakes, warehouses, swamps, pits, and graveyards
Data warehouse means data in a processed, probably actively useful form.
This term has been around since the seventies.
Data lake means little more than "'not a data warehouse"
It often means "we're tracking things, we'll see how it's useful later"
This frequently also means storing it in raw from, again because you don't know what you want to extract from it. (...though if the thing is by nature already a structured document, this is more about whether it's indexed than whether it's parsed)
In some cases, data lakes involve so little thinking ahead that
- you don't know what you have
- you will have a really tough time figuring how to use or combine or even find this data later
- you created what you could creatively call
- data swamps (inaccessible)
- data pits (won't produce clear insights)
- data graveyards (you're not using it, and you never will)
- searching for 'Data lake' will mostly give you IoT related results -- mostly because fancy promises aside, IoT primarily just means 'collect all the data (personal or not) and see what we can do with it later'.
- Data lake sometimes also includes the software that accesses the data - anything from ETL/ELT, and sometimes much fancier analytics.
- more so of this is a company trying to sell you analytics software
- In practice, though,
- data warehouse and data lake are often used interchangeably.
- (so) when data warehousing can mean both, then ETL and ELT are both seen in data warehousing
Data warehousing historically usually implies ETL (Extract, Transform, Load)
ETL means clean up your data (data types, data validity, transform for the data model you have) before when entering it into the system
- ...often so that it can be stored in a structured database, meaning
- you have specific goals in mind
- so have already thought about efficient querying, and ensured faster loading
- and you need to define that database up front
Data lakes more often take the ELT (Extract, Load, Transform) approach,
...which is a way of saying "Don't do anything at all, just store what we got, in the form we got it in, and worry about transforming it later
This is in part pragmatism:
- we can capture any and all data
- there's no point in spending time on modeling or cleaning data we may never actually use - we can do that work if and when we do
- you need not restrict yourself to your original goal. You might be able to ask more varied questions, and get more answers, if you keep the original data around.
However, the same thing that leads to these upsides (you didn't do anything yet) lead to downsides:
- you will parse and validate it on use
- ...either on every use
- this tends to be true for ad-hoc querying, which tends to mean mapreduce / pipeline solutions
- (who cares if that takes more power and is slower)
- ...or on first use, e.g. if you validate-and-store into a working set for every question you work on
- meaning you store even more data, and take more time, (sort of because you're creating a warehouse for each use)
Some such databases, by type
Not a model but a practicality:
File databases are useful to persist and store moderately structured data on disk,
- usually within a single file (sometimes a handful)
- and usually from a single, simple library
The goal is often a mix of tendencies like
- persist a larger set of possibly-somewhat-complex data without having to
- do the low-level storage
- think about efficient lookup so much
- (demand of users to) install and configure a database engine
- think about such an database's networking, auth (because there is none)
- (because it need only) support only a single program/client
- relatively relatively read-heavy
key-value file databases
The simpler variants may be based somewhat on the older, simple, yet effective dbm, a (family of) database engine(s) that stores key-value mappings (string to string) in a file (sometimes two, splitting out some metadata/index(verify)).
These may be understood as an on-disk variant of a hashmap, allows rebucketing to have the hashing scale well, and fixed-size buckets to allow relatively efficient modification.
They are also libraries, so easy to embed and avoid network serving details.
Within the dbm family, the ones that are interesting enough to use are probably roughly:
- dbm, 'database manager' (1979, AT&T)
- ndbm, 'new database manager' (1986, Berkeley), added support for the library to deal with multiple open databases
- sdbm, a 1987 clone of ndbm, made for licensing reasons
- gdbm, 'GNU database manager', also added arbitrary-length data
- Berkeley DB (a.k.a. BDB and sometimes BSD DB), optional concurrent access, transactions and such (ACIDity), etc.
There are also continuations / successors of this idea, including
- Tokyo Cabinet and related are basically a modern reimplementation of the dbm idea
- tdb (trivial database library): made for samba suite (so relatively recent), API like GDBM but safely allows concurrent writers 
- tdbm: variant of ndbm with atomic transactions, in-memory databases
- not to be confused with the ORM called "the database machine" and also abbreviating to tdbm
- MDBM: memory-mapped key-value database store derived (like sdbm and ndbm) 
- QDBM 
Berkeley DB notes
Berkeley DB, also known as BDB and libdb, is basically a key-value map in a file.
It is a library instead of a server, so can be embedded, and is used like that quite a bit.
For simpler (e.g. not-very-relational) ends it has lower and more predictable overhead than bulkier databases.
The low-level interface does not support concurrent write access from multiple processes.
It has some higher-level provisions for locking, transactions, logging, and such, but you have to choose to use them.
From different processes, you would use DBEnv(verify) to get BDB to use proper exclusion. Most features you have to explicitly ask for via options, which also control whether the env is safe e.g. for multithreading but not between processes, safe between for several processes, etc. See things like DBEnv::open() (e.g. DB_THREAD, (lack of) DB_PRIVATE), and also notes on shared memory regions.
- it being a library means it runs in the app's address space, minimizing cross-process copying and required context switches
- caching in shared memory
- option for mmapped read-only access (without the cache)
- option to keep database in memory rather than on disk
- concurrent access:
- writeahead logging or MVCC
- locking (fairly fine-grained)
- transactions (ACID), and recovery
- hot backups
- commits to multiple stores (XA interface), (since 2.5)
Both key and value are byte arrays; the application has to decide how it wishes to format and use data.
- Both key and value can be 232 bytes (4GB, though for keys that's usually not a great idea).
- A database file up to 248 bytes (256TB, which is more than various current filesystem limits).
It uses a cache to avoid lookup slowness, and a write-back cache to be more write-efficient.
There are multiple types of access / file format. They provide mostly the same functionality (keyed access as well as iteration over the set); the difference mostly in performance, and only when the data is large, since if all data fits in the cache, this is an near-non-issue.
For larger data sets you should consider how each type fits the way you access your data.
If your keys do not order the entries, you should consider hash or btree. When keys are ordered record numbers, you should probably go with recno, a.k.a. record, (fixed or variable-length records).
You can supply your own comparison and hash functions.
- Hash (DB_HASH)
- uses extended linear hashing; scales well and keeps minimal metadata
- supports insert and delete by key equality
- allows iteration, but in arbirtary order
- B+tree (DB_BTREE)
- ordered by keys (according to the comparison function defined at creation time. You can use this for access locality)
- allows lookup by range
- also keeps record numbers and allows access by them, but note that these change with changes in the tree, so are mostly useful for use by recno:
- recno (DB_RECNO)
- ordered records
- fast sequential access
- also with key-based random access - it is actually built on B+tree but generates keys internally
- queue (DB_QUEUE)
- fixed-size records
- fast sequential access
You can also open a BDB using DB_UNKNOWN, in which case the open call determines the type.
There are provisions to join databases on keys.
Versions and licenses
Early versions came from a library for simple hash, and later b-tree records.
- BDB 1.85 was a version for BSD 4.4, released in 1992 under a BSD-like license (that makes it behave like the GPL)
A Netscape request for further features led to the creation of Sleepycat Software in 1996
- BDB 2.1 was released in 1997, adding concurrent access, transactions
Sleepycat Software was acquired by Oracle Corporation in 2006. This seems to have had little licensing consequences on versions since then (verify)
Versions 2 and later, including the Oracle additions, are dual-licensed, under either:
- The Sleepycat Commercial License: A purchased license that does not force redistribution
- The Sleepycat Public License: software that uses Berkeley DB must be distributed (note: excludes in-house development)
- apparently, the case of language support/wrapping is a special one (not unlike LGPL) as in cases like python and perl interfaces, the software that uses BDB is the language, not any scripts that use that interface/BDB. This does not seem to apply in the case of C library use(verify).
In other words, you generally have three options:
- use freely while the application is and stays entirely personal / internal to your business
- you have to distribute the source of the application that uses BDB
- you have to get a paid license
There are now three products:
- Berkeley DB (the basic C library)
- Berkeley DB Java, a Java version with options for object persistence
- Berkeley DB XML, an XML database using XQuery/XPath (see e.g. )
Some additional licenses apply to the latter.
Tokyo Cabinet / Kyoto Cabinet
database library, so meant for single-process use.
Tokyo Cabinet (2007) (written in C) is a embedded key-value database, a successor to QDBM
- on-disk B+ trees, hash tables, or fixed-length array
- some transaction
- no real concurrent use
- process-safe via exclusion control (via file locking), but only one writer can be connected at a time
- threadsafe (meaning what exactly?)
Kyoto Cabinet (2009) is intended to be a successor.
written in C++, the code is simpler than Tokyo, intends to work better around threads. (Single-thread seems a little slower than Tokyo)
- Tokyo may be a little faster
- Tokyo may be a little more stable (at leas in earlier days of Kyoto's development)
- Kyoto may be simpler to use
- Kyoto may be simpler to install
Lightning Memory-Mapped Database
- Ordered-map store
- ACID via MVCC
Relational file databases
SQLite is a lightweight library-based database engine - no server, no configuration, and a database it works on is a single file.
Also meaning you can use it without requiring an admin to install and configure a database engine.
It's a little fancier than similar library-only databases. Its feature set includes:
- support for most of SQL92
- ACIDity (using file-based locking)
- recovery via journaling
- concurrent access
- views (to a degree)
- querying multiple databases (as in different files)
How well it scales depends on some choices.
- general RDBMS scaling notes apply, e.g. well-chosen indices, using transactions and not the default autocommit
- some SQLite-specific notes, e.g. that
- writes that would go to multiple databases are more efficient in a transaction, some cache tweaks
- it doesn't deal as well with multiple clients
Say the writers: "SQLite is not designed to replace Oracle. It is designed to replace fopen()"
Cases where SQLite is interesting:
- embedded things - SQLite and BDB have both been fairly common here for a while
- application memory between runs, other file-based persistence
- The ACID thing is nice
- the indexes are nice
- structured file storage
- so that you don't have to think about binary coding of complex data
- interchange of nontrivial data between different programs / languages
- (that doesn't need to be minimum-latency)
- when accessed via a generic database interfaces, you can fairly easily switch between sqlite to real rdbms
- can be useful during development
- Creative uses, such as local caching of data from a loaded central database
- you can also have memory-only tables
- simple dynamic websites
- particularly if mostly read-only
Limitations, real or perceived:
- many-client cases: while sqlite will function with multiple users, it will perform better with fewer users
- ...though concurrency is much better since WAL, i.e. since version 3.7
- Its typing is non-strict. In programming language terms, it is dynamically and weakly, basically duck-typed.
- sometimes convenient, sometimes weird
- no row-level locking (but then, that's not commonly the largest scaling issue)
- no foreign keys (can be worked around with triggers)
- no permissions (...with the reason that it's fairly pointless if you can read the file anyway)
- no VIEW writing (verify)
- the less-common RIGHT JOIN and FULL OUTER JOIN are unimplemented (verify)
- triggers are somewhat limited (verify)
String or BLOB exceeded size limit seems to mean that one or both of:
- a query size exceeded SQLITE_MAX_SQL_LENGTH (1MB by default)
- you can avoid this by using parametric values (binding) instead of literal query strings
- a row, text field, or blob field exceeded SQLITE_MAX_LENGTH (1GB by default)
(possibly other limits)
If course, you can set these these while compiling. SQLITE_MAX_LENGTH may be 2^31-1, and SQLITE_MAX_SQL_LENGTH may be no larger than SQLITE_MAX_LENGTH (and possibly also no larger than 1GB?(verify) -- not that wanting a 2GB query is very sane...
Array oriented file databases
CDF (Common Data Form) is a scalable, primarily array-oriented format, which seems to have been designed to use and to archive large amounts of sensor data, with it being self-described, appendable, with one writer and multiple readers.
...so bears some similarities to HDF5 in intent, and is found in similar contexts.
In fact, NetCDF4 chose HDF5 as a storage layer under the covers, so libraries could easily deal with both CDF and HDF5 (though CDF data representation, especially classic, is more restricted in practical use).
In different contexts it may refers to an abstract data model, an API to access arrays, a data format, or a particular implementation of all of that.
This is a little confusing, in that some of them have less constant over time. In particular, the data format for netCDF can be roughly split into
- classic (since 1989)
- classic with 64-bit offsets (since 2004)
- netCDF-4/HDF5 classic (since 2008)
- netCDF-4/HDF5 (since 2008)
The first three are largely interchangeable at API level, while the last, also called enhanced, allows more complex data representations that cannot be stored in classic.
Hierarchical Data Format (HDF)
Now typically meaning HDF5 (though HDF4 still sees some use).
Describes a data model, library, and file format.
- hierarchical refers to the fact that addressing-wise, it basically implements filesystem-like names within it
- Stores various array-like stuff, and halfway clever about readout of parts from huge datasets.
- structured, potentially complex data
- primarily cases where you have many items following the same structure, and often numerical data (but can store others)
- think time series, large sets of 2D or 3D imagery (think also meteorology), raster images, tables, other n-dimensional arrays
- fast lookup on predictable items
- offset lookups in arrays, B-tree indexes where necessary; consideration for strides, random access, and more.
- a little forethought helps, though
- portable data (a settled binary format, and self-described files)
- dealing with ongoing access to possibly-growing datasets
- parallel parallel IO has been considered
- multiple applications accessing a dataset, parallelizing IO accesses, allows is use on clustered filesystems
Unsorted (file databases)
Tokyo (and Kyoto)
Tokyo Tyrant / Kyoto Tycoon
database server, so useful when you want some concurrent use.
Supports expiry, so can act much like memcached
full-text search system
content management system built around cabinet.
Presentable in BBS, blog, and Wiki style
Storagey stuff - kv or document stores
- Weakly typed, document-oriented store
- values can be lists
- values can be embedded documents (maps)
- on its fields, dynamic
- e.g. ) - which together specifies a single query operation (e.g. it always sorts before limiting
- supportable with indices 
- field indexes - basic index
- compound indexes - indexes a combination, e.g. first looking for a userid, then something per-userid)
- multikey indexes - allows matching by one of the values for a field
- 2d geospatial - 'within radius', basically
- text search
- indexes can be:
- hash index - equality only, rather than the default sorted index (note: doesn't work on multi-key)
- partial index - only index documents matching a filter
- sparse index - only index documents having that have the field
- sharding, replication, and combination
- replication is like master/slave w/failover, plus when the primary leaves a new primary gets elected. If it comes back it becomes a secondary to the new primary.
- attach binary blobs
- exact handling depends on your driver
- note: for storage of files that may be over 16MB, consider GridFS
- sort of like JSON, but binary, and has some extra things (like a date type)
- Not the fastest NoSQL variant in a bare-metal sense
- but often a good functionality/scalability tradeoff for queries that are a little more complex than just than key-value
- no transactions, but there are e.g. atomic update modifiers ("update this bunch of things at once")
- 2D geo indexing
- GridFS: chunking large files and actually having them backed by mongo.
- point being you can get a distributed filesystem
- One to one relationships? You probably want it in the same document. For most It saves extra lookups.
- E.g. a user's address
- One to many? Similar to one-to-many.
- E.g. a user's address when they have more than one.
- Always think about typical accesses and typical changes.
- For example, moving an entire family may go wrong because values have to change in many placers. (but then, it often might in RDBMS too because a foreign key would have to change in many places)
- foreign-key-like references can be in both places, because values can be lists and queries can search in them
- Usually, avoid setups where these lists will keep growing.
- when you refer to other documents where contents will not change, you could duplicate that part if useful for e.g. brief displays, so you can do those without an extra lookup.
- sometimes such denormalized information can actually be a good thing (for data-model sanity)
- e.g. the document for an invoice can list the exact text it had, plus references made. E.g. updating a person's address will not change the invoice -- but you can always resolve the reference and note that the address has since changed.
- if you know your product will evolve in production, add a version attribute (and add application logic that knows how to augment previous versions to the current one)
Also there's a document size limit
You can set _id yourself, but if you don't it'll get a unique, GUID-like identifier.
Key-value store with a focus on concurrency and fault tolerance,
Pluggable backends, e.g. allowing use as just a memcache, or give it persistence.
Eventually consistent (with some strong-consistency experiments?(verify))
Ideally you fix the cluster size ahead of time. When you add nodes, contents are redistributed (verify)
- all keys in hashtable in RAM (fast, but limiting the amount of items via available RAM)
- file copy = hot backup (verify)
- keys stored on-disk
- secondary indexes, so limited limited relational-style querying at decent performance
- data compression
- no hot backup
- objects in ram
It aims to distribute perfectly (and supports some other features by assumptions), which implies you have to fix your cluster size ahead of time
Pluggable backends mean you can have it persist (default) or effectively be a distributed memcache
Key-value store, distributed by using Raft consensus.
It is arguably a Kubernetes / Google thing now and has lessening general value, in part due to requiring gRPC and HTTP/2, and threatening to abandon its existing API.
(not to be confused with couchbase)
Made to be compatible with memcachedb(verify), but with persistence.
- structured documents (schemaless)
- can attach binary blobs to documents
- RESTful HTTP/JSON API (to write, query)
- so you could do with little or no middle-end (you'll need some client-side rendering)
- shards its data
- eventually consistent
- ACIDity per doment operation (not larger, so inherently relational data)
- no foreign keys, no transactions
- running map-reduce on your data
- best fit for mapreduce tasks
- because it's distributed, it's an eventually consistent thing - you have no guarantee of delivery, update order, or timeliness
- which is nice for merging updated made remotely/offline (e.g. useful for mobile things)
- and don't use it as a message queue, or other things where you want these guarantees
- for acidity and conflict resolution, not in a store-forever way.
- An update will conflict if someone did an update based on the same version -- as it should.
document ~= row
- view group = process
- nice way to scale
- sharding is a bit harder
- not in views
- if large, consider CDNs, a simpler nosql key-val store, etc.
Made in part to allow storage in the browser while offline, and push it to CouchDB later, with minimal translation.
(previously known as Membase) (not to be confused with CouchDB)
CouchDB-like document store, plus a memcached-compatible interface
Differences to CouchDB include:
- optional immediate consistency for individual operations
- allows LDAP auth
- declarative query language
- stronger consistency design
Distributed key-value with optional transactional API
See also TiDB, which is basically an SQL layer on top that makes it NewSQL.
Mixed-model but seems built on (ordered) kv.
Distributed key-value store, a fork of Google's LevelDB
key-value or document store.
stronger consistency guarantees than many other things
supports transactions on more than a single object - making it more ACID-style than most any nosql. By default is looser, for speed(verify)
very interesting performance
optionally applies a schema
Most parts are BSD license. The warp add on, which provides fault-tolerant transactional part, is licensed. The evaluation variant you get omits the fault-tolerance guarantee part.
Document store with a push mechanism, to allow easier/better real-timeness than continuous polling/querying.
Storagey stuff - wide-column style
As wide column is arguably just a specialized use of a kv store
Distributed wide-column store
Uses CQL for its API (historically that was Thrift)
Seems to scale better than others, though apparently at cost of general latency
People report (cross-datacenter) replication works nicely.
Values availability, with consistency between being somewhat secondary NOTSURE (verify)
(if consistency is more important than availability, look at HBase instead)
- http://labs.google.com/papers/bigtable.html for the article on Bigtable
- stored on a per-column family basis
Storagey stuff - graph style
This one is mostly about the way you model your data, and the operations you can do, and do with fair efficienct.
..in that you can use e.g. key-value stores in a graph-like ways, and when you don't use the fancier features, the two may functionally be hard to tell apart.
- key-value store
- with things like counting
- typed - types/structures include: counter, list, set, sorted set (hash-based), hash, bitarray (and some geo data types)
- that you get a basic set of queries/operations for
- allows transactions, which lets you do your own atomic updates when necessary
- pub/sub 
- (completely separate from the keyspace)
- can be sharded 
- in-client mapping (so must be consistent if you use it as a store rather than a cache!)
- primarily in-memory
- though allows persisting by
- logging requests for recovery that way (AOF), and/or
- snapshotting current state, e.g. every hour (RDB)
- ...but both are still more of a persistent cache than a primary store
- a memcache of sorts (session data, common resources)
- intermediate between services (see pub/sub)
- rate limiting between
- most/all types
- SET, MGET
- GET, MSET
- SETNX - set only if did not exist
- SETEX - set value and expiration
- EXPIRE, TTL, PERSIST
- integer / counter
- GET, MGET
- float value
- GETRANGE (substring)
- LPUSH, RPUSH
- LPOP, RPOP
- LRANGE (fetch by index range)
- LTRIM, RTRIM (throw away al)
- BLPOP, BRPOP - block until present (until timeout, waiting clients served in order, can wait on multiple lists -- all behaviour useful for things like producer-consumer pattern)
- RPOPLPUSH, BRPOPLPUSH - may be more appropriate when building queues
- HSET, HGET
- HMSET, HMGET (multiple)
- MULTI, ended by either EXEC or DISCARD
- WATCH, UNWATCH
memcached is a memory-only key-value store
- evicts items
- by least use when memory limit is hit (LRU style, geared to keep most-used data)
- by explicit eviction time, if given
- client can shard to multiple backends - an approach that has some some footnotes
- no persistence, no locks, no complex operations (like wildcard queries) - which helps it guarantee low latency
It is not:
- storage. It's specifically not backed by disk
- a document store. Keys are limited to 250 characters and values to 1MB
- if you want larger, look at other key-value stores (and at distributed filesystems)
- You are probably looking for a distributed filesystem if you are expecting that. (you can look at memcachedb and MogileFS, and there are many others)
- a cacheing database proxy
- you have to do the work of figuring out what to cache
- you have to do the work of figuring out dependencies
- you have to do the work of figuring invalidations
Note that it is easy to think of Lucene/Solr/ES as only a text search engine.
But it is implemented as a document store, with basic indices that act as selectors.
So it is entirely possibly to do more structured storage (and fairly schemaless when ES guesses the type of new fields), with some arbitrary selectors, which make it potentially quite useful for for analytics and monitoring.
And since it adds sharding, it scales pretty decently.
Sure, it won't be as good at CRUD operations, so it's not your primary database, but it can work great as a fast structured queryable thing.
Time series databases are often used to show near-realtime graphs of things that happen, while also being archives.
They are aimed at being efficient at range queries,
and often have functionality that helps
Is now part of a larger stack:
- InfluxDB - time series database
- Telegraf - agent used to ease collecting metrics. Some pluggable input / aggregation/ processing things
- Kapacitor - streams/batch processing on the server side
- Chronograf - dashboard.
- also some interface fto Kapacitor, e.g. for alerts
- Often compared to Grafana. Initially simpler than that, but more similar now
Flux refers to a query language used in some places.
InfluxDB can be distributed, and uses distributed consensus to stay synced.
Open-source, though some features (like distribution) are enterprise-only.
Data model notes
In comparison to a more relational view...
- database is a logical container for
- retention policies
- time series data
- continuous queries
- retention policy (RP) contain:
- replication factor (copies kept in the cluster) (autogen's default is 1)
- retention - how long to keep the data (min 1h) (autogen's default is infinite)
- shard group duration - how much data is stored in shards (min 1h) (autogen's default is 7d)
- measurements - each measurement is implicitly part of the retention policy you put it in
- each database can have one or more RPs
- you get a default called autogen (defaults mentioned above)
- you'll quickly notice them in addressing (testdb.autogen.measurementname) though can ignore everything about them at first
- measurement are like a table, containing tags, fields and points
- basically refers to a (measurement,tag) combination you'd likely use in querying -- see below
- tags are key-value pairs (column-like) you can add per datapoint that are
- part of a series' uniqueness, and is indexed
- basically, whatever you need for lookup
- limited to 64K (and you probably don't want to get close without good reason)
- field are key-value pairs (column-like) you can add per datapoint that are
- not part of its uniqueness, not indexed
- basically, any values you may be looking up
- types include float (32-bit), integer (64-bit), boolean, timestamp, or string
- a measurement takes the type of the first value it gets (and currently cannot be changed except with some creativity), so e.g. forcing integer (add ) over float is sometimes necessary, e.g. to store large values without losing precision
- keep in mind that a field will not take different types over time, even if it might be fine, so being consistent per measurement is a good idea. You'll see errors in the log like
- strings possibly not limited to 64K? (I've seen conflicting information)
- but you probably don't want to use influxdb as a blob store if you want it to stay efficient
- you can check current type with show field keys
- (data) points
- always have a time(verify)
- there is always an index on time
- time precision is another detail
Typical use of measurements, series, tags
Say you want to start keeping track of CPU use and are collecting it for various datacenters (various tutorials use an example like this).
You might have a
- specific database for this purpose (for admin reasons)
- retention policy mostly because you want monitoring stuff deleted after a year without thinking about it
- measurement called host_monitor
and want to enter a datapoint with
- tags like hostname=node4,datacenter=berlin,country=de
- fields like cpu0=88,cpu2=32
You'll notice this is a pile of everything CPU-related.
The tags are structured with common uses in mind, often the coarsest and finest things you anticipate querying on - you can e.g. pick out and e.g. average per country, or pick out a particular host if needed (hostnames only unique within datacenters - i.e. combined with other tags).
Series are basically the idea that each unique combination of (measurement,all_tags) represents a series.
Data you send in from different places will often imply unique series, through having unique tags, though to some degree they are more of a querying concept, and a storage one only insofar that the indexing helps that.(verify)
On point uniqueness
A point is unique (measurementname,tagset,timestamp), so if you write when a record with that tuple already exists, field values are merged/overwritten.
Depending on the timestamp precision you hand into the ingest url, this
- may never happen, if it's based on 'now'
- may never be an issue, if the precision is much higher than the interval you send in
- may be something you do intentionally
On timestamp precision
Timestamps are currently nanosecond resolution by default. This can be reduced to microsecond, millisecond or second.
Lower-precision timestamps lead to
- a little more on-disk compression
- overwrites data with the same timestamp (see previous point)
- Is that per database, series, datapoint at insertion time?
- Does it mix precision if you alter precision over time?
/query queries, management /write ingest, takes line protocol
/ping health and version
/debug/pprof Generate profiles for troubleshooting /debug/requests Track HTTP client requests to the /write and /query endpoints /debug/vars Collect internal InfluxDB statistics
The line protocol is a one-liner text presentation that looks like
measurement,tag_set field_set timestamp
- tag_set and key_set are comma-separated key=val pairs
- timestamp is nanosecond-precision Unix time
- (also optional; defaults to local timestamp, UTC, but be aware of
- clock drift (so you likely want NTP)
- timezones (so have a serious think about using either client time or server time))
- (also optional; defaults to local timestamp, UTC, but be aware of
Clients may ease conversion of structured data to line protocol.
On shards and shard groups
Under the hood, data is stored in shards, shards are grouped in shard groups, shard groups are part of retention policies.
This is under-the-hood stuff you don't really need to know, though it may be useful to consider in that shard groups are related to
- the granularity with which old data is removed because of retention policy (it's dropped in units of shard groups - so never immediately)
- efficiency of the typical query, e.g. when most queries deal with just the last one and the rest is rarely touched, essentially archived and rarely or never touched by IO
Logging is great for debugging, but is both verbose and (unless all its clients POST instead of GET) puts possibly-sensitive information in system logs.
In which case you probably want to set log-enabled = false in [http]
Note that if you put a HTTP server in front, logging same may apply there as well.
If apache, look at something like
SetEnvIf Request_URI "/submit" dontlog CustomLog /var/log/apache2/access.log combined env=!dontlog
Because it's designed to be clustered, it serves on all interfaces by default (and names should be resolvable).
On a single-node installation you could choose a localhost-only via bind-address in the [http] section, which you'd want as
bind-address = "localhost:8086" # default is ":8086"
(There used to be an admin web interface on port 8083, but this has been removed[]. You now probably want to use Chronograf)
For similar reasons, by default there is no authentication
- you may wish to firewall things at IP level
- if you want auth, you need to enable security and create users (see below)
- auth happens at HTTP request scope, e.g. for the API and CLI
- certain service endpoints are not authenticated
- can do HTTPS itself 
- you can get server certs and client certs - and use self-signed ones if you wish
- note that in a microservice style setup, you may wish to do this on the edge / ingest sides instead
User authenticationEnable: set in the section and restart
- use basic auth
- hand in sername and assword in the URL or body
- JSON Web Tokens
If nonlocal, it's recommended you use HTTPS, because all of these options are effectively plaintext.
New non-admin users have no rights. They can be given
- WRITE, or
- ALL (meaning READ and WRITE)
New admin users have a lot more granularity, like
- CREATE DATABASE, and DROP DATABASE
- DROP SERIES and DROP MEASUREMENT
- CREATE RETENTION POLICY, ALTER RETENTION POLICY, and DROP RETENTION POLICY
- CREATE CONTINUOUS QUERY and DROP CONTINUOUS QUERY
- user management
InfluxQL - an SQL-like language 
Flux - a more featured language 
InfluxQL examples (you may want to run the cli, e.g. :
where that argument is for human-readable time formatting)
A simple query would be
SELECT "eth0_rx", "eth0_tx" FROM "pc_monitor"
would be all data of that series that we have
SELECT "cpu_used" FROM "pc_monitor" WHERE time > now() - 15m
but when you want a timeseries you often want to regularize it like:
SELECT mean("cpu_used") FROM "pc_monitor" WHERE time > now() - 15m GROUP BY time(1m) fill(null)
This adds a time interval (1m), what to do with multiple values (aggregate into the mean)
On fill(): GROUP BY time() creates regular intervals(verify), so it has to do something for intervals with no data. Options:
- null: return timestamp with null value (default)
- none: omit entry for time range
- previous: copy value from previous time interval
- linear: linear interpolation
Getting the most recent value
SELECT last("cpu_used") FROM "pc_monitor" WHERE time > now() - 1h
- last() aggregate is what it sounds like
- you want a time limit, to avoid selecting the entire time series for that to
- you probably want that anyway, when you care to view something current
For a gauge, you may want a recent average, like:
SELECT mean(cpu_used) FROM "pc_monitor" WHERE time > now() - 5s group by time(5s) ORDER BY time desc
- because this compares against now, the most recent interval that GROUP creates may not have a value in it yet, meaning you'll get
SELECT LAST(eth0_tx) from pc_monitor
SELECT LAST(field_name), * from test_result GROUP BY *
GROUP BY * effectively separates by series
SELECT LAST(*) from pc_monitor group by *
Practically similar to
SELECT * FROM "pc_monitor" WHERE time > now() - 15s ORDER BY time desc limit 1
Though you may like some averaging, like
SELECT mean(cpu_used) FROM "pc_monitor" WHERE time > now() - 5s group by time(5s) fill(none) ORDER BY time desc limit 1
Note that without the ORDER BY time desc limit 1 you'ld probaly get two time periods (at least until the group time is at least twice the selection time)
Dealing with null
If this is about removing too-old data, the never-think-about-it approach is to set up retention policies.
...but yes, you can do things like:
DROP MEASUREMENT "net"
- all data and series from a measurement
DROP SERIES FROM "net" WHERE hostname='myhostname'
- drops all series that apply
DELETE FROM "net_monitor" WHERE hostname='myhostname' and time < now() - 1h
- the delete granularity is effectively measurements (not tags) (verify)
- this won't delete the series, even if it removes all points
DROP SHARD shardid
- you'd probably get the shard id from
Use the CLI, something like something chronograf or grafana.
There used to be an interface [ https://docs.influxdata.com/influxdb/cloud/query-data/execute-queries/data-explorer/ this?]
> SHOW DATABASES _internal foo > use foo Using database monitor > show series
Backup and restore
influxd backup -database name -portable backup.data
Because of the compression done to older data, and the often-quite-compressible nature of time series, most monitoring needs don't really need to worry about space use.
This of course does scale with the amount of counters, and the time resolution of insertion.
For example, in a 70 day test with dozens of counters inserting in intervals between 2 to 300 seconds, space sawtoothed (because the compression is staged) up 200MB, from 500ish to 700ish.
Not nothing, and on embedded you probably still want rrdtool, but also nothing to worry about on, say, raspberries or small VPSes.
Separate install and binary, so needs to be pointed at a InfluxDB instance
Tied to Postgres
Storagey stuff - NewSQL
NewSQL usually points at distributed variants of SQL-accessed databases.
So things that act like RDBMSes in the sense of
- giving of SQL features (not just a SQL-looking query language),
- providing transactions,
- and as ACID as possible while also proviiding some amount of scaling.
NewSQL is in some ways an answer to NoSQL that says "you know we can have some scaling without giving up all the features, right?"
- distributed SQL
- PostgreSQL API
- Cassandra-like API
- MySQL interface
- tries to be OLTP and OLAP
- handles DDL better?
- in spired by Google Spanner
- see aksi TiKV - TiDB is basically the SQL layer on top of TiKV (verify)
Aurora (Amazon-hosted only)
- MySQL and PostgreSQL API to something a little more scalable than those by themselves
- basically BigTable's successor