Some databases, sorted by those types

Database related

More theoretical - thinking about databases:

Broad qualifiers, a.k.a. thinking about what you want from your database ·
Describing database systems
- Data consistency and versioning, and its concepts
- Database model types · Some databases, sorted by those types

Everyday-use notes

File databases

Not a model but a practicality:

File databases are useful to persist and store moderately structured data on disk,

often storing many items in one (or a few files)

usually accessible via a single, simple library - meaning they are also considered an embedded database

There is typically also some way to alter that data in terms of its items (rather than, as in serialization, the whole thing at once), which may be limited, or just not very performant.

The goal is often a mix of

persist a moderate set of possibly-somewhat-complex data without having to

do the low-level storage

think about efficient lookup so much

(demand of users to) install and configure a database engine

think about such an database's networking, auth (because there is none)

(because it need only) support only a single program/client

relatively read-heavy

avoids requiring an external process

that has to be running and someone has to start and be responsible for

key-value file databases

The simpler variants may be based somewhat on the older, simple, yet effective dbm, a (family of) database engine(s) that stores key-value mappings (string to string) in a file (sometimes two, splitting out some metadata/index(verify)).

These may be understood as an on-disk variant of a hashmap, allows rebucketing to have the hashing scale well, and fixed-size buckets to allow relatively efficient modification.

They are also libraries, so easy to embed and avoid network serving details.

dbm family

Within the dbm family, the ones that are interesting enough to use are probably roughly:

dbm, 'database manager' (1979, AT&T)
ndbm, 'new database manager' (1986, Berkeley), added support for the library to deal with multiple open databases
- sdbm, a 1987 clone of ndbm, made for licensing reasons

gdbm, 'GNU database manager', also added arbitrary-length data

Berkeley DB (a.k.a. BDB and sometimes BSD DB), optional concurrent access, transactions and such (ACIDity), etc.

There are also continuations / successors of this idea, including

Tokyo Cabinet and related are basically a modern reimplementation of the dbm idea

tdb (trivial database library): made for samba suite (so relatively recent), API like GDBM but safely allows concurrent writers [1]

tdbm: variant of ndbm with atomic transactions, in-memory databases

not to be confused with the ORM called "the database machine" and also abbreviating to tdbm

MDBM: memory-mapped key-value database store derived (like sdbm and ndbm) [2]

QDBM [3]

Berkeley DB notes

Berkeley DB, also known as BDB and libdb, is basically a key-value map in a file.

It is a library instead of a server, so can be embedded, and is used like that quite a bit.

For simpler (e.g. not-very-relational) ends it has lower and more predictable overhead than bulkier databases.

Technical notes

There is a low-level interface, that does not support concurrent write access from multiple processes.

It also has some higher-level provisions for locking, transactions, logging, and such, but you have to choose to use them, and then may want to specify whether it's to be safe between threads and/or processes, and other such details.

From different processes, you would probably use DBEnv(verify) to get BDB to use proper exclusion. Most features you have to explicitly ask for via options - see things like DBEnv::open() (e.g. DB_THREAD, (lack of) DB_PRIVATE), and also notes on shared memory regions.

Interesting aspects/features:

it being a library means it runs in the app's address space, minimizing cross-process copying and required context switches
caching in shared memory
option for mmapped read-only access (without the cache)
option to keep database in memory rather than on disk
concurrent access:
- writeahead logging or MVCC
- locking (fairly fine-grained)
- transactions (ACID), and recovery
hot backups
Distribution:
- replication
- commits to multiple stores (XA interface), (since 2.5)

Both key and value are byte arrays; the application has to decide how it wishes to format and use data.

Both key and value can be 2³² bytes (4GB, though for keys that's usually not a great idea).

A database file up to 2⁴⁸ bytes (256TB, which is more than various current filesystem limits).

It uses a cache to avoid lookup slowness, and a write-back cache to be more write-efficient.

Format/access types

There are multiple types of access / file format. They provide mostly the same functionality (keyed access as well as iteration over the set); the difference mostly in performance, and only when the data is large, since if all data fits in the cache, this is an near-non-issue.

For larger data sets you should consider how each type fits the way you access your data.

If your keys do not order the entries, you should consider hash or btree. When keys are ordered record numbers, you should probably go with recno, a.k.a. record, (fixed or variable-length records).

You can supply your own comparison and hash functions.

More details:

Hash (DB_HASH)
- uses extended linear hashing; scales well and keeps minimal metadata
- supports insert and delete by key equality
- allows iteration, but in arbirtary order

B+tree (DB_BTREE)
- ordered by keys (according to the comparison function defined at creation time. You can use this for access locality)
- allows lookup by range
- also keeps record numbers and allows access by them, but note that these change with changes in the tree, so are mostly useful for use by recno:

recno (DB_RECNO)
- ordered records
- fast sequential access
- also with key-based random access - it is actually built on B+tree but generates keys internally

queue (DB_QUEUE)
- fixed-size records
- fast sequential access

You can also open a BDB using DB_UNKNOWN, in which case the open call determines the type.

There are provisions to join databases on keys.

Versions and licenses

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

⚠ this is probably incorrect in parts
it's hard to find good details

Early versions came from a library for simple hash, and later b-tree records.

BDB 1.85 was a version for BSD 4.4, released in 1992 under a BSD-like license (that makes it behave like the GPL)

A Netscape request for further features led to the creation of Sleepycat Software in 1996

BDB 2.1 was released in 1997, adding concurrent access, transactions

Sleepycat Software was acquired by Oracle Corporation in 2006. This seems to have had little licensing consequences on versions since then (verify)

Versions 2 and later, including the Oracle additions, are dual-licensed, under either:

The Sleepycat Commercial License: A purchased license that does not force redistribution
The Sleepycat Public License: software that uses Berkeley DB must be distributed (note: excludes in-house development)
- apparently, the case of language support/wrapping is a special one (not unlike LGPL) as in cases like python and perl interfaces, the software that uses BDB is the language, not any scripts that use that interface/BDB. This does not seem to apply in the case of C library use(verify).

In other words, you generally have three options:

use freely while the application is and stays entirely personal / internal to your business
you have to distribute the source of the application that uses BDB
you have to get a paid license

There are now three products:

Berkeley DB (the basic C library)
Berkeley DB Java, a Java version with options for object persistence
Berkeley DB XML, an XML database using XQuery/XPath (see e.g. [4])

Some additional licenses apply to the latter.

Tokyo Cabinet / Kyoto Cabinet

database library, so meant for single-process use.

Tokyo Cabinet (2007) (written in C) is a embedded key-value database, a successor to QDBM

on-disk B+ trees, hash tables, or fixed-length array
multithreaded
some transaction

no real concurrent use

process-safe via exclusion control (via file locking), but only one writer can be connected at a time

threadsafe (meaning what exactly?)

Kyoto Cabinet (2009) is intended to be a successor.

written in C++, the code is simpler than Tokyo, intends to work better around threads. (Single-thread seems a little slower than Tokyo)

Comparison:

Tokyo may be a little faster
Tokyo may be a little more stable (at leas in earlier days of Kyoto's development)
Kyoto may be simpler to use
Kyoto may be simpler to install

LightningDB / LMDB

Lightning Memory-Mapped Database, a.k.a. LightningDB (and MDB before a rename)

Ordered-map store
ACID via MVCC
concurrency

https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database

http://symas.com/mdb/

cdb

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

cdb (Constant DataBase), basically a fast on-disk associative array [5] [6]

Also: some new things

e.g. LevelDB and RocksDB, on this page filed under kv store

Relational file databases

SQLite

SQLite is a library-based database engine.

Below is some introduction. For more specific/technical use notes, see SQLite notes

No server, no admin rights required to install or start it or to use a network port or (not) firewall that port.

A database is a single file (with some helper files that appear temporarily but you typically don't have to think about), and there is no configuration (beyond things you do per-file-database, some of which are stored inside, some of which you have to be consistent about).

It's not optimized to be multi-client, but functionally you can get that to some degree.

It's a little fancier than similar library-only databases, in that it e.g. supports

recovery via journaling
concurrent access (...to a degree)
most of SQL92
constraints
indices
ACIDity
views (to a degree)

Cases where SQLite is interesting:

embedded things - SQLite and BDB have both been fairly common here for a good while
giving an application memory between runs

without reinventing various wheels

storage for simple dynamic websites (without needing a database, though on shared hosting this is usually just there anyway)

particularly if mostly read-only

interchange of nontrivial data between different programs / languages

(that doesn't need to be minimum-latency)

when accessed via a generic database interfaces, you can fairly easily switch between sqlite to real rdbms

e.g. during development it's useful to do tests with an sqlite backend rather than a full installed database

Creative uses, such as local caching of data from a loaded central database

note you can also have memory-only tables

Limitations, real or perceived:

for concurrency and ACIDity from multiple clients, it requires file locking, which not all filesystems implement fully - particularly some network mounts
while sqlite will function with multiple users, assume it will perform better with fewer users (or only simple interactions)
no foreign keys before 3.6.19 (can be worked around with triggers) and they're still turned off by default
no VIEW writing (verify)
triggers are somewhat limited (verify)

Unlike larger database systems

the less-common RIGHT JOIN and FULL OUTER JOIN are unimplemented (verify) (but they're not used much, and you can rewrite queries)
no permission system (...it'd be fairly pointless if you can read the file anyway)
it is dynamically typed, meaning the type can vary per row, regardless of column type

This is probably not your initial expectation (certainly unlike most RDBMS), and you need to know what it does.

this is sometimes convenient, and sometimes weird reason you need extra wrangling

Like larger database engines

How well it performs depends on some choices
Autocommit (often the default) is good for concurrency, bad for performance (and sqlite libraries may add their own behaviour)
for larger things you want well-chosen indices
some cases of ALTER TABLE basically imply creating a new table

-->

Array oriented file databases

CDF

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

CDF (Common Data Form) is a scalable, primarily array-oriented format, which seems to have been designed to use and to archive large amounts of sensor data, with it being self-described, appendable, with one writer and multiple readers.

...so bears some similarities to HDF5 in intent, and is found in similar contexts.

In fact, NetCDF4 chose HDF5 as a storage layer under the covers, so libraries could easily deal with both CDF and HDF5 (though CDF data representation, especially classic, is more restricted in practical use).

In different contexts it may refers to an abstract data model, an API to access arrays, a data format, or a particular implementation of all of that.

This is a little confusing, in that some of those have been pretty constant, and others have not. In particular, the data format for netCDF can be roughly split into

classic (since 1989)
classic with 64-bit offsets (since 2004)
netCDF-4/HDF5 classic (since 2008)
netCDF-4/HDF5 'enhanced' (since 2008)

The first three are largely interchangeable at API level, while the last allows more complex data representations that cannot be stored in classic.

See also:

https://www.unidata.ucar.edu/software/netcdf/

http://en.wikipedia.org/wiki/Netcdf

And perhaps

https://cdn.earthdata.nasa.gov/conduit/upload/497/ESDS-RFC-022v1.pdf

https://earthdata.nasa.gov/esdis/eso/standards-and-references/netcdf-4hdf5-file-format

Hierarchical Data Format (HDF)

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

HDF describes a data model, library, and file format.

Now typically meaning HDF5 (though HDF4 still sees some use).

Goals/features:

hierarchical refers to the fact that addressing-wise, it basically implements filesystem-like names within it

Stores various array-like stuff, and halfway clever about readout of parts from huge datasets.

structured, potentially complex data

primarily cases where you have many items following the same structure, and often numerical data (but can store others)

think time series, large sets of 2D or 3D imagery (think also meteorology), raster images, tables, other n-dimensional arrays

fast lookup on predictable items

offset lookups in arrays, B-tree indexes where necessary; consideration for strides, random access, and more.

a little forethought helps, though

portable data (a settled binary format, and self-described files)

dealing with ongoing access to possibly-growing datasets

parallel parallel IO has been considered

multiple applications accessing a dataset, parallelizing IO accesses, allows is use on clustered filesystems

See also

Apache Arrow

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Arrow primarily is an abstraction on working with certain types of data.

Arrow is not a database., in the sense that you cannot assume that something that implements it lets you alter data efficiently.

Yet Arrow does propose to efficiently serialize data and pass pass data around, e.g.

thinking about keeping things relatively light in-memory

zero-copy where shared memory is possible (which is far from all cases),

prefer to streams in chunks if/where copies are necessary

fast to iterate and

preferably near-O(1) for random access - though this does not hold (at all) for various of its serialisations

think about parallel IO

It prefers to work with column-style data, and lend itself to table-like things though technically is more like hybrid columnar, there's no reason it won't store other types of data.

Arrow is not a single serialization/interchange/storage format.

...there are multiple, and they have distinct properties. And yes, this makes "arrow file" confusingly ambiguous.

The IPC/feather format (uncompressed, more about faster interchange)
- in streaming style OR
- in random access style

optionally memory mapped (and can be by multiple processes)

Parquet (compressed)

more about storage-efficient archiving

much less about efficient random access

(you can also load more classical formats like CSV and JSONL but lose most of the efficient lazy-loading features)

There is also pyarrow.dataset which

read or write partitioned sets of files
- can understand certain (mostly predefined) ways you may have partitioned data into multiple files
- do lazy loading, which can help RAM use (even on large Parquet, CVS, and JSONL)
can selectively load columns
and filter by values
Can load from S3 (and MinIO) and HDFS

https://en.wikipedia.org/wiki/Apache_Arrow

Unsorted (file databases)

Other/similar include:

Unsorted

http://en.wikipedia.org/wiki/Scientific_data_format#Scientific_data_formats_.28data_exchange.29

Tokyo (and Kyoto)

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Tokyo Tyrant / Kyoto Tycoon

database server, so useful when you want some concurrent use.

Supports expiry, so can act much like memcached

Tokyo Dystopia

full-text search system

http://fallabs.com/tokyodystopia/

Tokyo Promenade

content management system built around cabinet.

Presentable in BBS, blog, and Wiki style

http://fallabs.com/tokyopromenade/

HamsterDB

https://github.com/GerHobbelt/hamsterdb

Which do I use?

Storagey stuff - kv or document stores

Ask for a key, get a blob of uninterpreted data, or map of somewhat structured data.

MongoDB notes

tl;dr

Weakly typed, document-oriented store

values can be lists

values can be embedded documents (maps)

searchable

on its fields, dynamic

e.g. db.collname.find({author:"mike"}).sort({date:-1}).limit(10) - which together specifies a single query operation (e.g. it always sorts before limiting [7])

supportable with indices [8]

field indexes - basic index

compound indexes - indexes a combination, e.g. first looking for a userid, then something per-userid)

multikey indexes - allows matching by one of the values for a field

2d geospatial - 'within radius', basically

text search

indexes can be:

hash index - equality only, rather than the default sorted index (note: doesn't work on multi-key)

partial index - only index documents matching a filter

sparse index - only index documents having that have the field

sharding, replication, and combination

replication is like master/slave w/failover, plus when the primary leaves a new primary gets elected. If it comes back it becomes a secondary to the new primary.

attach binary blobs

exact handling depends on your driver[9]

note: for storage of files that may be over 16MB, consider GridFS

Protocol/format is binary (BSON[10]) (as is the actual storage(verify))

sort of like JSON, but binary, and has some extra things (like a date type)

Not the fastest NoSQL variant in a bare-metal sense

but often a good functionality/scalability tradeoff for queries that are a little more complex than just than key-value

no transactions, but there are e.g. atomic update modifiers ("update this bunch of things at once")

2D geo indexing

GridFS: chunking large files and actually having them backed by mongo.

point being you can get a distributed filesystem

mongo shell interface is javascript

Schema considerations

One to one relationships? You probably want it in the same document. For most It saves extra lookups.

E.g. a user's address

One to many? Similar to one-to-many.

E.g. a user's address when they have more than one.

Always think about typical accesses and typical changes.

For example, moving an entire family may go wrong because values have to change in many placers. (but then, it often might in RDBMS too because a foreign key would have to change in many places)

foreign-key-like references can be in both places, because values can be lists and queries can search in them

Usually, avoid setups where these lists will keep growing.

when you refer to other documents where contents will not change, you could duplicate that part if useful for e.g. brief displays, so you can do those without an extra lookup.

sometimes such denormalized information can actually be a good thing (for data-model sanity)

e.g. the document for an invoice can list the exact text it had, plus references made. E.g. updating a person's address will not change the invoice -- but you can always resolve the reference and note that the address has since changed.

if you know your product will evolve in production, add a version attribute (and add application logic that knows how to augment previous versions to the current one)

Also there's a document size limit

You can set _id yourself, but if you don't it'll get a unique, GUID-like identifier.

See also:

http://en.wikipedia.org/wiki/MongoDB

https://www.youtube.com/watch?v=w5qr4sx5Vt0

GUI-ish browsers:

https://robomongo.org/

https://www.mongodb.com/products/compass

riak notes

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Key-value store with a focus on concurrency and fault tolerance,

Pluggable backends, e.g. allowing use as just a memcache, or give it persistence.

Eventually consistent (with some strong-consistency experiments?(verify))

masterless

Ideally you fix the cluster size ahead of time. When you add nodes, contents are redistributed (verify)

Backends include

bitcask

all keys in hashtable in RAM (fast, but limiting the amount of items via available RAM)

file copy = hot backup (verify)

leveldb

keys stored on-disk

secondary indexes, so limited limited relational-style querying at decent performance

data compression

no hot backup

innostore

memory

objects in ram

It aims to distribute perfectly (and supports some other features by assumptions), which implies you have to fix your cluster size ahead of time

Pluggable backends mean you can have it persist (default) or effectively be a distributed memcache

etcd

Key-value store, distributed by using Raft consensus.

It is arguably a Kubernetes / Google thing now and has lessening general value, in part due to requiring gRPC and HTTP/2, and threatening to abandon its existing API.

CouchDB notes

(not to be confused with couchbase)

Document store.

Made to be compatible with memcachedb(verify), but with persistence.

structured documents (schemaless)

can attach binary blobs to documents

RESTful HTTP/JSON API (to write, query)

so you could do with little or no middle-end (you'll need some client-side rendering)

shards its data

eventually consistent

ACIDity per doment operation (not larger, so inherently relational data)

no foreign keys, no transactions

running map-reduce on your data

Views

best fit for mapreduce tasks

Replication

because it's distributed, it's an eventually consistent thing - you have no guarantee of delivery, update order, or timeliness

which is nice for merging updated made remotely/offline (e.g. useful for mobile things)

and don't use it as a message queue, or other things where you want these guarantees

revisions

for acidity and conflict resolution, not in a store-forever way.

An update will conflict if someone did an update based on the same version -- as it should.

Couchapps,

document ~= row

Notes:

view group = process

nice way to scale

sharding is a bit harder

Attachments

not in views
if large, consider CDNs, a simpler nosql key-val store, etc.

See also:

https://www.youtube.com/watch?v=BKQ9kXKoHS8

PouchDB

Javascript analogue to CouchDB.

Made in part to allow storage in the browser while offline, and push it to CouchDB later, with minimal translation.

Couchbase notes

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

(previously known as Membase) (not to be confused with CouchDB)

CouchDB-like document store, plus a memcached-compatible interface

Differences to CouchDB include:

typing
optional immediate consistency for individual operations
allows LDAP auth
declarative query language

stronger consistency design

TiKV

Distributed key-value with optional transactional API

See also TiDB, which is basically an SQL layer on top that makes it NewSQL.

FoundationDB

Mixed-model but seems built on (ordered) kv.

Networked, can be clustered (replication, partitioning).

Does transactions, serializable isolation, so actually tries towards ACIDity.

Tries to avoid some issues by not allowing transactions to run more than a few seconds, exceed 10MB of writes, and other limitations like that.

https://en.wikipedia.org/wiki/FoundationDB

LevelDB

On-disk key-value store

License: New BSD

https://en.wikipedia.org/wiki/LevelDB

https://github.com/google/leveldb

RocksDB

A Facebook fork of LevelDB, focusing on some specific performance details

License: Apache2 / GPL2

https://rocksdb.org/

https://en.wikipedia.org/wiki/RocksDB

MonetDB

Column store

https://www.monetdb.org/Documentation/Manuals/MonetDB/Architecture

hyperdex

key-value or document store.

stronger consistency guarantees than various other things

supports transactions on more than a single object - making it more ACID-style than most any nosql. By default is looser, for speed(verify)

very interesting performance

optionally applies a schema

Most parts are BSD license. The warp add on, which provides fault-tolerant transactional part, is licensed. The evaluation variant you get omits the fault-tolerance guarantee part.

RethinkDB

Document store with a push mechanism, to allow easier/better real-timeness than continuous polling/querying.

https://www.rethinkdb.com/

Storagey stuff - wide-column style

As wide column is arguably just a specialized use of a kv store, but one that deserves its own name because it is specifically optimized for that use.

Cassandra

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Distributed wide-column store

Uses CQL for its API (historically that was Thrift)

Seems to scale better than others, though apparently at cost of general latency

masterless

People report (cross-datacenter) replication works nicely.

Values availability, with consistency between being somewhat secondary NOTSURE (verify)

(if consistency is more important than availability, look at HBase instead)

Hbase

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

An implementation imitating Google's Bigtable, part of Hadoop family (and built on top of HDFS).

See also:

https://en.wikipedia.org/wiki/Apache_HBase

http://labs.google.com/papers/bigtable.html for the article on Bigtable

stored on a per-column family basis

Hypertable

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

See also:

http://www.hypertable.org/

Accumulo

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

https://accumulo.apache.org/

Storagey stuff - graph style

This one is mostly about the way you model your data, and the operations you can do, and do with fair efficienct.

..in that you can use e.g. key-value stores in a graph-like ways, and when you don't use the fancier features, the two may functionally be hard to tell apart.

ArangoDB

https://www.arangodb.com/

Apache Giraph

http://giraph.apache.org/

Neo4j

OrientDB

FlockDB

AllegroGraph

http://franz.com/agraph/allegrograph/

GraphBase

GraphPack

Cachey stuff

redis notes

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

tl;dr

key-value store

with things like counting

typed - types/structures include: counter, list, set, sorted set (hash-based), hash, bitarray (and some geo data types)

that you get a basic set of queries/operations for

allows transactions, which lets you do your own atomic updates when necessary

pub/sub [11]

(completely separate from the keyspace)

can be sharded [12]

shard mapping is in-client -- so must be consistent if you use it as a store (rather than a cache)

https://www.tutorialspoint.com/redis/redis_partitioning.htm

primarily in-memory

though allows persisting by

logging requests for recovery that way (AOF), and/or

snapshotting current state, e.g. every hour (RDB)

...but both are still more of a persistent cache than a primary store

Uses

a memcache of sorts (session data, common resources)

intermediate between services (see pub/sub)

rate limiting between

On commands:

most/all types

SET, MGET

GET, MSET

SETNX - set only if did not exist

SETEX - set value and expiration

MSETNX

MSETEX

GETSET

DEL

EXISTS

TYPE

EXPIRE, TTL, PERSIST

http://redis.io/commands#generic

http://redis.io/commands#keys

integer / counter

GET, MGET

INCR

DECR

INCRBY

DECRBY

float value

INCRBYFLOAT

bytestrings

STRLEN

APPEND

GETRANGE (substring)

SETRANGE

bitstrings

GETBIT

SETBIT

BITCOUNT

BITOP

BITPOS

list

LPUSH, RPUSH

LPOP, RPOP

LRANGE (fetch by index range)

LTRIM, RTRIM (throw away al)

BLPOP, BRPOP - block until present (until timeout, waiting clients served in order, can wait on multiple lists -- all behaviour useful for things like producer-consumer pattern)

RPOPLPUSH, BRPOPLPUSH - may be more appropriate when building queues

http://redis.io/commands#list

hash

HSET, HGET

HMSET, HMGET (multiple)

http://redis.io/commands#hash

transactions:

MULTI, ended by either EXEC or DISCARD

WATCH, UNWATCH

http://redis.io/commands#transactions

memcached

memcached is a memory-only key-value store

evicts items

by least use when memory limit is hit (LRU style, geared to keep most-used data)

by explicit eviction time, if given

client can shard to multiple backends - an approach that has some some footnotes

no persistence, no locks, no complex operations (like wildcard queries) - which helps it guarantee low latency

It is not:

storage. It's specifically not backed by disk

a document store. Keys are limited to 250 characters and values to 1MB

if you want larger, look at other key-value stores (and at distributed filesystems)

redundant

You are probably looking for a distributed filesystem if you are expecting that. (you can look at memcachedb and MogileFS, and there are many others)

a cacheing database proxy

you have to do the work of figuring out what to cache

you have to do the work of figuring out dependencies

you have to do the work of figuring invalidations

Originally developed for livejournal (by Danga Interactive) and released under a BSD-style license.

Searchy stuff

ElasticSearch

Note that it is easy to think of Lucene/Solr/ES as only a text search engine.

But it is implemented as a document store, with basic indices that act as selectors.

So it is entirely possibly to do more structured storage (and fairly schemaless when ES guesses the type of new fields), with some arbitrary selectors, which make it potentially quite useful for for analytics and monitoring.

And since it adds sharding, it scales pretty decently.

Sure, it won't be as good at CRUD operations, so it's not your primary database, but it can work great as a fast structured queryable thing.

Time seriesy

Time series databases are often used to show near-realtime graphs of things that happen, while also being archives.

They are aimed at being efficient at range queries,

and often have functionality that helps

InfluxDB

InfluxDB is primarily meant as a time series database, though can be abused for other things.

TimescaleDB

Tied to Postgres

OpenTSDB

Graphite

See Data_logging_and_graphing#Graphite_notes

Storagey stuff - NewSQL

NewSQL usually points at distributed variants of SQL-accessed databases.

So things that act like RDBMSes in the sense of

giving of SQL features (not just a SQL-looking query language),

providing transactions,

and as ACID as possible while also proviiding some amount of scaling.

NewSQL is in some ways an answer to NoSQL that says "you know we can have some scaling without giving up all the features, right?"

YugabyteDB

distributed SQL

PostgreSQL API

Cassandra-like API

https://en.wikipedia.org/wiki/YugabyteDB

TiDB

MySQL interface

tries to be OLTP and OLAP

handles DDL better?

https://en.wikipedia.org/wiki/TiDB

in spired by Google Spanner

see aksi TiKV - TiDB is basically the SQL layer on top of TiKV (verify)

CockroachDB

https://en.wikipedia.org/wiki/CockroachDB

VoltDB

https://en.wikipedia.org/wiki/VoltDB

Unsorted (NewSQL)

Aurora (Amazon-hosted only)

MySQL and PostgreSQL API to something a little more scalable than those by themselves

https://en.wikipedia.org/wiki/Amazon_Aurora

Spanner

basically BigTable's successor

https://en.wikipedia.org/wiki/Spanner_(database)

H-Store

https://en.wikipedia.org/wiki/H-Store

See also:

https://en.wikipedia.org/wiki/NewSQL

Some databases, sorted by those types

File databases

key-value file databases

dbm family

Berkeley DB notes

Technical notes

Format/access types

Versions and licenses

See also

Tokyo Cabinet / Kyoto Cabinet

LightningDB / LMDB

cdb

Also: some new things

Relational file databases

SQLite

Array oriented file databases

CDF

Hierarchical Data Format (HDF)

Apache Arrow

Unsorted (file databases)

Tokyo (and Kyoto)

Tokyo Tyrant / Kyoto Tycoon

Tokyo Dystopia

Tokyo Promenade

HamsterDB

Which do I use?

Storagey stuff - kv or document stores

MongoDB notes

riak notes

etcd

CouchDB notes

PouchDB

Couchbase notes

TiKV

FoundationDB

LevelDB

RocksDB

MonetDB

hyperdex

RethinkDB

Storagey stuff - wide-column style

Cassandra

Hbase

Hypertable

Accumulo

Storagey stuff - graph style

ArangoDB

Apache Giraph

Neo4j

OrientDB

FlockDB

AllegroGraph

GraphBase

GraphPack

Cachey stuff

redis notes

memcached

Searchy stuff

ElasticSearch

Time seriesy

InfluxDB

TimescaleDB

OpenTSDB

Graphite

Storagey stuff - NewSQL

YugabyteDB

TiDB

CockroachDB

VoltDB

Unsorted (NewSQL)

Navigation menu