Some databases, sorted by those types

From Helpful
(Redirected from SQLite)
Jump to navigation Jump to search
Database related

More theoretical - thinking about databases:

Everyday-use notes

File databases

Not a model but a practicality:

File databases are useful to persist and store moderately structured data on disk,

often storing many items in one (or a few files)
usually accessible via a single, simple library - meaning they are also considered an embedded database



There is typically also some way to alter that data in terms of its items (rather than, as in serialization, the whole thing at once), which may be limited, or just not very performant.


The goal is often a mix of

  • persist a moderate set of possibly-somewhat-complex data without having to
do the low-level storage
think about efficient lookup so much
(demand of users to) install and configure a database engine
think about such an database's networking, auth (because there is none)
  • (because it need only) support only a single program/client
  • relatively read-heavy
  • avoids requiring an external process
that has to be running and someone has to start and be responsible for


key-value file databases

The simpler variants may be based somewhat on the older, simple, yet effective dbm, a (family of) database engine(s) that stores key-value mappings (string to string) in a file (sometimes two, splitting out some metadata/index(verify)).

These may be understood as an on-disk variant of a hashmap, allows rebucketing to have the hashing scale well, and fixed-size buckets to allow relatively efficient modification.


They are also libraries, so easy to embed and avoid network serving details.


dbm family

Within the dbm family, the ones that are interesting enough to use are probably roughly:

  • dbm, 'database manager' (1979, AT&T)
  • ndbm, 'new database manager' (1986, Berkeley), added support for the library to deal with multiple open databases
    • sdbm, a 1987 clone of ndbm, made for licensing reasons
  • gdbm, 'GNU database manager', also added arbitrary-length data
  • Berkeley DB (a.k.a. BDB and sometimes BSD DB), optional concurrent access, transactions and such (ACIDity), etc.


There are also continuations / successors of this idea, including

  • Tokyo Cabinet and related are basically a modern reimplementation of the dbm idea
  • tdb (trivial database library): made for samba suite (so relatively recent), API like GDBM but safely allows concurrent writers [1]
  • tdbm: variant of ndbm with atomic transactions, in-memory databases
not to be confused with the ORM called "the database machine" and also abbreviating to tdbm
  • MDBM: memory-mapped key-value database store derived (like sdbm and ndbm) [2]


Berkeley DB notes

Berkeley DB, also known as BDB and libdb, is basically a key-value map in a file.

It is a library instead of a server, so can be embedded, and is used like that quite a bit.

For simpler (e.g. not-very-relational) ends it has lower and more predictable overhead than bulkier databases.


Technical notes

There is a low-level interface, that does not support concurrent write access from multiple processes.


It also has some higher-level provisions for locking, transactions, logging, and such, but you have to choose to use them, and then may want to specify whether it's to be safe between threads and/or processes, and other such details.

From different processes, you would probably use DBEnv(verify) to get BDB to use proper exclusion. Most features you have to explicitly ask for via options - see things like DBEnv::open() (e.g. DB_THREAD, (lack of) DB_PRIVATE), and also notes on shared memory regions.


Interesting aspects/features:

  • it being a library means it runs in the app's address space, minimizing cross-process copying and required context switches
  • caching in shared memory
  • option for mmapped read-only access (without the cache)
  • option to keep database in memory rather than on disk
  • concurrent access:
    • writeahead logging or MVCC
    • locking (fairly fine-grained)
    • transactions (ACID), and recovery
  • hot backups
  • Distribution:
    • replication
    • commits to multiple stores (XA interface), (since 2.5)


Both key and value are byte arrays; the application has to decide how it wishes to format and use data.

Both key and value can be 232 bytes (4GB, though for keys that's usually not a great idea).
A database file up to 248 bytes (256TB, which is more than various current filesystem limits).

It uses a cache to avoid lookup slowness, and a write-back cache to be more write-efficient.

Format/access types

There are multiple types of access / file format. They provide mostly the same functionality (keyed access as well as iteration over the set); the difference mostly in performance, and only when the data is large, since if all data fits in the cache, this is an near-non-issue.

For larger data sets you should consider how each type fits the way you access your data.

If your keys do not order the entries, you should consider hash or btree. When keys are ordered record numbers, you should probably go with recno, a.k.a. record, (fixed or variable-length records).


You can supply your own comparison and hash functions.

More details:

  • Hash (DB_HASH)
    • uses extended linear hashing; scales well and keeps minimal metadata
    • supports insert and delete by key equality
    • allows iteration, but in arbirtary order
  • B+tree (DB_BTREE)
    • ordered by keys (according to the comparison function defined at creation time. You can use this for access locality)
    • allows lookup by range
    • also keeps record numbers and allows access by them, but note that these change with changes in the tree, so are mostly useful for use by recno:
  • recno (DB_RECNO)
    • ordered records
    • fast sequential access
    • also with key-based random access - it is actually built on B+tree but generates keys internally
  • queue (DB_QUEUE)
    • fixed-size records
    • fast sequential access


You can also open a BDB using DB_UNKNOWN, in which case the open call determines the type.

There are provisions to join databases on keys.


Versions and licenses
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
this is probably incorrect in parts
it's hard to find good details


Early versions came from a library for simple hash, and later b-tree records.

  • BDB 1.85 was a version for BSD 4.4, released in 1992 under a BSD-like license (that makes it behave like the GPL)


A Netscape request for further features led to the creation of Sleepycat Software in 1996

  • BDB 2.1 was released in 1997, adding concurrent access, transactions


Sleepycat Software was acquired by Oracle Corporation in 2006. This seems to have had little licensing consequences on versions since then (verify)


Versions 2 and later, including the Oracle additions, are dual-licensed, under either:

  • The Sleepycat Commercial License: A purchased license that does not force redistribution
  • The Sleepycat Public License: software that uses Berkeley DB must be distributed (note: excludes in-house development)
    • apparently, the case of language support/wrapping is a special one (not unlike LGPL) as in cases like python and perl interfaces, the software that uses BDB is the language, not any scripts that use that interface/BDB. This does not seem to apply in the case of C library use(verify).

In other words, you generally have three options:

  • use freely while the application is and stays entirely personal / internal to your business
  • you have to distribute the source of the application that uses BDB
  • you have to get a paid license


There are now three products:

  • Berkeley DB (the basic C library)
  • Berkeley DB Java, a Java version with options for object persistence
  • Berkeley DB XML, an XML database using XQuery/XPath (see e.g. [4])

Some additional licenses apply to the latter.


See also


Tokyo Cabinet / Kyoto Cabinet

database library, so meant for single-process use.


Tokyo Cabinet (2007) (written in C) is a embedded key-value database, a successor to QDBM

  • on-disk B+ trees, hash tables, or fixed-length array
  • multithreaded
  • some transaction
  • no real concurrent use
process-safe via exclusion control (via file locking), but only one writer can be connected at a time
threadsafe (meaning what exactly?)


Kyoto Cabinet (2009) is intended to be a successor.

written in C++, the code is simpler than Tokyo, intends to work better around threads. (Single-thread seems a little slower than Tokyo)


Comparison:

  • Tokyo may be a little faster
  • Tokyo may be a little more stable (at leas in earlier days of Kyoto's development)
  • Kyoto may be simpler to use
  • Kyoto may be simpler to install


LightningDB / LMDB

Lightning Memory-Mapped Database, a.k.a. LightningDB (and MDB before a rename)

  • Ordered-map store
  • ACID via MVCC
  • concurrency

https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database

http://symas.com/mdb/

cdb

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
  • cdb (Constant DataBase), basically a fast on-disk associative array [5] [6]


Also: some new things

e.g. LevelDB and RocksDB, on this page filed under kv store

Relational file databases

SQLite

SQLite is a library-based database engine.

Below is some introduction. For more specific/technical use notes, see SQLite notes



No server, no admin rights required to install or start it or to use a network port or (not) firewall that port.

A database is a single file (with some helper files that appear temporarily but you typically don't have to think about), and there is no configuration (beyond things you do per-file-database, some of which are stored inside, some of which you have to be consistent about).

It's not optimized to be multi-client, but functionally you can get that to some degree.


It's a little fancier than similar library-only databases, in that it e.g. supports

  • recovery via journaling
  • concurrent access (...to a degree)
  • most of SQL92
  • constraints
  • indices
  • ACIDity
  • views (to a degree)


Cases where SQLite is interesting:

  • embedded things - SQLite and BDB have both been fairly common here for a good while
  • giving an application memory between runs
without reinventing various wheels
  • storage for simple dynamic websites (without needing a database, though on shared hosting this is usually just there anyway)
particularly if mostly read-only
  • interchange of nontrivial data between different programs / languages
(that doesn't need to be minimum-latency)
  • when accessed via a generic database interfaces, you can fairly easily switch between sqlite to real rdbms
e.g. during development it's useful to do tests with an sqlite backend rather than a full installed database
  • Creative uses, such as local caching of data from a loaded central database
note you can also have memory-only tables


Limitations, real or perceived:

  • for concurrency and ACIDity from multiple clients, it requires file locking, which not all filesystems implement fully - particularly some network mounts
  • while sqlite will function with multiple users, assume it will perform better with fewer users (or only simple interactions)
  • no foreign keys before 3.6.19 (can be worked around with triggers) and they're still turned off by default
  • no VIEW writing (verify)
  • triggers are somewhat limited (verify)


Unlike larger database systems

  • the less-common RIGHT JOIN and FULL OUTER JOIN are unimplemented (verify) (but they're not used much, and you can rewrite queries)
  • no permission system (...it'd be fairly pointless if you can read the file anyway)
  • it is dynamically typed, meaning the type can vary per row, regardless of column type
This is probably not your initial expectation (certainly unlike most RDBMS), and you need to know what it does.
this is sometimes convenient, and sometimes weird reason you need extra wrangling

Like larger database engines

  • How well it performs depends on some choices
  • Autocommit (often the default) is good for concurrency, bad for performance (and sqlite libraries may add their own behaviour)
  • for larger things you want well-chosen indices
  • some cases of ALTER TABLE basically imply creating a new table

-->

Array oriented file databases

CDF

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

CDF (Common Data Form) is a scalable, primarily array-oriented format, which seems to have been designed to use and to archive large amounts of sensor data, with it being self-described, appendable, with one writer and multiple readers.


...so bears some similarities to HDF5 in intent, and is found in similar contexts.

In fact, NetCDF4 chose HDF5 as a storage layer under the covers, so libraries could easily deal with both CDF and HDF5 (though CDF data representation, especially classic, is more restricted in practical use).


In different contexts it may refers to an abstract data model, an API to access arrays, a data format, or a particular implementation of all of that.

This is a little confusing, in that some of those have been pretty constant, and others have not. In particular, the data format for netCDF can be roughly split into

  • classic (since 1989)
  • classic with 64-bit offsets (since 2004)
  • netCDF-4/HDF5 classic (since 2008)
  • netCDF-4/HDF5 'enhanced' (since 2008)

The first three are largely interchangeable at API level, while the last allows more complex data representations that cannot be stored in classic.


See also:


And perhaps

Hierarchical Data Format (HDF)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

HDF describes a data model, library, and file format.

Now typically meaning HDF5 (though HDF4 still sees some use).


Goals/features:

  • hierarchical refers to the fact that addressing-wise, it basically implements filesystem-like names within it
  • Stores various array-like stuff, and halfway clever about readout of parts from huge datasets.
  • structured, potentially complex data
primarily cases where you have many items following the same structure, and often numerical data (but can store others)
think time series, large sets of 2D or 3D imagery (think also meteorology), raster images, tables, other n-dimensional arrays
  • fast lookup on predictable items
offset lookups in arrays, B-tree indexes where necessary; consideration for strides, random access, and more.
a little forethought helps, though
  • portable data (a settled binary format, and self-described files)
  • dealing with ongoing access to possibly-growing datasets
  • parallel parallel IO has been considered
multiple applications accessing a dataset, parallelizing IO accesses, allows is use on clustered filesystems


See also

Apache Arrow

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Arrow primarily is an abstraction on working with certain types of data.


Arrow is not a database., in the sense that you cannot assume that something that implements it lets you alter data efficiently.

Yet Arrow does propose to efficiently serialize data and pass pass data around, e.g.

thinking about keeping things relatively light in-memory
zero-copy where shared memory is possible (which is far from all cases),
prefer to streams in chunks if/where copies are necessary
fast to iterate and
preferably near-O(1) for random access - though this does not hold (at all) for various of its serialisations
think about parallel IO


It prefers to work with column-style data, and lend itself to table-like things though technically is more like hybrid columnar, there's no reason it won't store other types of data.


Arrow is not a single serialization/interchange/storage format.

...there are multiple, and they have distinct properties. And yes, this makes "arrow file" confusingly ambiguous.

  • The IPC/feather format (uncompressed, more about faster interchange)
    • in streaming style OR
    • in random access style
optionally memory mapped (and can be by multiple processes)
  • Parquet (compressed)
more about storage-efficient archiving
much less about efficient random access
(you can also load more classical formats like CSV and JSONL but lose most of the efficient lazy-loading features)


There is also pyarrow.dataset which

  • read or write partitioned sets of files
    • can understand certain (mostly predefined) ways you may have partitioned data into multiple files
    • do lazy loading, which can help RAM use (even on large Parquet, CVS, and JSONL)
  • can selectively load columns
  • and filter by values
  • Can load from S3 (and MinIO) and HDFS



https://en.wikipedia.org/wiki/Apache_Arrow


Unsorted (file databases)

Other/similar include:


Unsorted

Tokyo (and Kyoto)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
Tokyo Tyrant / Kyoto Tycoon

database server, so useful when you want some concurrent use.

Supports expiry, so can act much like memcached


Tokyo Dystopia

full-text search system

http://fallabs.com/tokyodystopia/


Tokyo Promenade

content management system built around cabinet.

Presentable in BBS, blog, and Wiki style


http://fallabs.com/tokyopromenade/


HamsterDB

https://github.com/GerHobbelt/hamsterdb

Which do I use?

Storagey stuff - kv or document stores

Ask for a key, get a blob of uninterpreted data, or map of somewhat structured data.

MongoDB notes

tl;dr

  • Weakly typed, document-oriented store
values can be lists
values can be embedded documents (maps)
  • searchable
on its fields, dynamic
e.g. db.collname.find({author:"mike"}).sort({date:-1}).limit(10) - which together specifies a single query operation (e.g. it always sorts before limiting [7])
supportable with indices [8]
field indexes - basic index
compound indexes - indexes a combination, e.g. first looking for a userid, then something per-userid)
multikey indexes - allows matching by one of the values for a field
2d geospatial - 'within radius', basically
text search
indexes can be:
hash index - equality only, rather than the default sorted index (note: doesn't work on multi-key)
partial index - only index documents matching a filter
sparse index - only index documents having that have the field


  • sharding, replication, and combination
replication is like master/slave w/failover, plus when the primary leaves a new primary gets elected. If it comes back it becomes a secondary to the new primary.
  • attach binary blobs
exact handling depends on your driver[9]
note: for storage of files that may be over 16MB, consider GridFS


  • Protocol/format is binary (BSON[10]) (as is the actual storage(verify))
sort of like JSON, but binary, and has some extra things (like a date type)
  • Not the fastest NoSQL variant in a bare-metal sense
but often a good functionality/scalability tradeoff for queries that are a little more complex than just than key-value
  • no transactions, but there are e.g. atomic update modifiers ("update this bunch of things at once")


  • 2D geo indexing
  • GridFS: chunking large files and actually having them backed by mongo.
point being you can get a distributed filesystem


  • mongo shell interface is javascript


Schema considerations

  • One to one relationships? You probably want it in the same document. For most It saves extra lookups.
E.g. a user's address
  • One to many? Similar to one-to-many.
E.g. a user's address when they have more than one.
  • Always think about typical accesses and typical changes.
For example, moving an entire family may go wrong because values have to change in many placers. (but then, it often might in RDBMS too because a foreign key would have to change in many places)
  • foreign-key-like references can be in both places, because values can be lists and queries can search in them
Usually, avoid setups where these lists will keep growing.
  • when you refer to other documents where contents will not change, you could duplicate that part if useful for e.g. brief displays, so you can do those without an extra lookup.
  • sometimes such denormalized information can actually be a good thing (for data-model sanity)
e.g. the document for an invoice can list the exact text it had, plus references made. E.g. updating a person's address will not change the invoice -- but you can always resolve the reference and note that the address has since changed.
  • if you know your product will evolve in production, add a version attribute (and add application logic that knows how to augment previous versions to the current one)

Also there's a document size limit



You can set _id yourself, but if you don't it'll get a unique, GUID-like identifier.


See also:


GUI-ish browsers:

https://robomongo.org/


https://www.mongodb.com/products/compass


riak notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Key-value store with a focus on concurrency and fault tolerance,

Pluggable backends, e.g. allowing use as just a memcache, or give it persistence.

Eventually consistent (with some strong-consistency experiments?(verify))

masterless

Ideally you fix the cluster size ahead of time. When you add nodes, contents are redistributed (verify)


Backends include

  • bitcask
all keys in hashtable in RAM (fast, but limiting the amount of items via available RAM)
file copy = hot backup (verify)
  • leveldb
keys stored on-disk
secondary indexes, so limited limited relational-style querying at decent performance
data compression
no hot backup
  • innostore
  • memory
objects in ram

It aims to distribute perfectly (and supports some other features by assumptions), which implies you have to fix your cluster size ahead of time


Pluggable backends mean you can have it persist (default) or effectively be a distributed memcache


etcd

Key-value store, distributed by using Raft consensus.


It is arguably a Kubernetes / Google thing now and has lessening general value, in part due to requiring gRPC and HTTP/2, and threatening to abandon its existing API.

CouchDB notes

(not to be confused with couchbase)


Document store.

Made to be compatible with memcachedb(verify), but with persistence.


  • structured documents (schemaless)
can attach binary blobs to documents
  • RESTful HTTP/JSON API (to write, query)
so you could do with little or no middle-end (you'll need some client-side rendering)
  • shards its data
  • eventually consistent
  • ACIDity per doment operation (not larger, so inherently relational data)
no foreign keys, no transactions
  • running map-reduce on your data
  • Views
best fit for mapreduce tasks
  • Replication
because it's distributed, it's an eventually consistent thing - you have no guarantee of delivery, update order, or timeliness
which is nice for merging updated made remotely/offline (e.g. useful for mobile things)
and don't use it as a message queue, or other things where you want these guarantees
  • revisions
for acidity and conflict resolution, not in a store-forever way.
An update will conflict if someone did an update based on the same version -- as it should.
  • Couchapps,


document ~= row


Notes:

  • view group = process
nice way to scale
  • sharding is a bit harder


Attachments

  • not in views
  • if large, consider CDNs, a simpler nosql key-val store, etc.

See also:


PouchDB

Javascript analogue to CouchDB.

Made in part to allow storage in the browser while offline, and push it to CouchDB later, with minimal translation.

Couchbase notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

(previously known as Membase) (not to be confused with CouchDB)


CouchDB-like document store, plus a memcached-compatible interface


Differences to CouchDB include:

  • typing
  • optional immediate consistency for individual operations
  • allows LDAP auth
  • declarative query language
  • stronger consistency design


TiKV

Distributed key-value with optional transactional API


See also TiDB, which is basically an SQL layer on top that makes it NewSQL.


FoundationDB

Mixed-model but seems built on (ordered) kv.

Networked, can be clustered (replication, partitioning).

Does transactions, serializable isolation, so actually tries towards ACIDity.

Tries to avoid some issues by not allowing transactions to run more than a few seconds, exceed 10MB of writes, and other limitations like that.


https://en.wikipedia.org/wiki/FoundationDB



LevelDB

On-disk key-value store

License: New BSD

https://en.wikipedia.org/wiki/LevelDB

https://github.com/google/leveldb


RocksDB

A Facebook fork of LevelDB, focusing on some specific performance details

License: Apache2 / GPL2


https://rocksdb.org/

https://en.wikipedia.org/wiki/RocksDB

MonetDB

Column store

https://www.monetdb.org/Documentation/Manuals/MonetDB/Architecture

hyperdex

key-value or document store.

stronger consistency guarantees than many other things

supports transactions on more than a single object - making it more ACID-style than most any nosql. By default is looser, for speed(verify)

very interesting performance

optionally applies a schema


Most parts are BSD license. The warp add on, which provides fault-tolerant transactional part, is licensed. The evaluation variant you get omits the fault-tolerance guarantee part.

RethinkDB

Document store with a push mechanism, to allow easier/better real-timeness than continuous polling/querying.


https://www.rethinkdb.com/

Storagey stuff - wide-column style

As wide column is arguably just a specialized use of a kv store, but one that deserves its own name because it is specifically optimized for that use.


Cassandra

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Distributed wide-column store

Uses CQL for its API (historically that was Thrift)


Seems to scale better than others, though apparently at cost of general latency

masterless


People report (cross-datacenter) replication works nicely.


Values availability, with consistency between being somewhat secondary NOTSURE (verify)

(if consistency is more important than availability, look at HBase instead)


See also:


http://basho.com/posts/technical/riak-vs-cassandra/


Hbase

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


An implementation imitating Google's Bigtable, part of Hadoop family (and built on top of HDFS).


See also:



  • stored on a per-column family basis


Hypertable

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

See also:


Accumulo

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

https://accumulo.apache.org/

Storagey stuff - graph style

This one is mostly about the way you model your data, and the operations you can do, and do with fair efficienct.

..in that you can use e.g. key-value stores in a graph-like ways, and when you don't use the fancier features, the two may functionally be hard to tell apart.


ArangoDB

https://www.arangodb.com/


Apache Giraph

http://giraph.apache.org/


Neo4j

OrientDB

FlockDB

AllegroGraph

http://franz.com/agraph/allegrograph/


GraphBase

GraphPack

Cachey stuff

redis notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

tl;dr

  • key-value store
with things like counting
  • typed - types/structures include: counter, list, set, sorted set (hash-based), hash, bitarray (and some geo data types)
that you get a basic set of queries/operations for
  • allows transactions, which lets you do your own atomic updates when necessary
(completely separate from the keyspace)
shard mapping is in-client -- so must be consistent if you use it as a store (rather than a cache)
https://www.tutorialspoint.com/redis/redis_partitioning.htm
  • primarily in-memory
though allows persisting by
logging requests for recovery that way (AOF), and/or
snapshotting current state, e.g. every hour (RDB)
...but both are still more of a persistent cache than a primary store


Uses

  • a memcache of sorts (session data, common resources)
  • intermediate between services (see pub/sub)
  • rate limiting between


On commands:

  • most/all types
SET, MGET
GET, MSET
SETNX - set only if did not exist
SETEX - set value and expiration
MSETNX
MSETEX
GETSET
DEL
EXISTS
TYPE
EXPIRE, TTL, PERSIST
http://redis.io/commands#generic
http://redis.io/commands#keys
  • integer / counter
GET, MGET
INCR
DECR
INCRBY
DECRBY
  • float value
INCRBYFLOAT
  • bytestrings
STRLEN
APPEND
GETRANGE (substring)
SETRANGE
  • bitstrings
GETBIT
  • SETBIT
BITCOUNT
BITOP
BITPOS
  • list
LPUSH, RPUSH
LPOP, RPOP
LRANGE (fetch by index range)
LTRIM, RTRIM (throw away al)
BLPOP, BRPOP - block until present (until timeout, waiting clients served in order, can wait on multiple lists -- all behaviour useful for things like producer-consumer pattern)
RPOPLPUSH, BRPOPLPUSH - may be more appropriate when building queues
http://redis.io/commands#list
  • hash
HSET, HGET
HMSET, HMGET (multiple)
http://redis.io/commands#hash
  • transactions:
MULTI, ended by either EXEC or DISCARD
WATCH, UNWATCH
http://redis.io/commands#transactions

memcached

memcached is a memory-only key-value store

  • evicts items
by least use when memory limit is hit (LRU style, geared to keep most-used data)
by explicit eviction time, if given
  • client can shard to multiple backends - an approach that has some some footnotes
  • no persistence, no locks, no complex operations (like wildcard queries) - which helps it guarantee low latency


It is not:

  • storage. It's specifically not backed by disk
  • a document store. Keys are limited to 250 characters and values to 1MB
if you want larger, look at other key-value stores (and at distributed filesystems)
  • redundant
You are probably looking for a distributed filesystem if you are expecting that. (you can look at memcachedb and MogileFS, and there are many others)
  • a cacheing database proxy
you have to do the work of figuring out what to cache
you have to do the work of figuring out dependencies
you have to do the work of figuring invalidations


Originally developed for livejournal (by Danga Interactive) and released under a BSD-style license.


See also

Searchy stuff

ElasticSearch

See also Lucene and things that wrap it#ElasticSearch


Note that it is easy to think of Lucene/Solr/ES as only a text search engine.

But it is implemented as a document store, with basic indices that act as selectors.


So it is entirely possibly to do more structured storage (and fairly schemaless when ES guesses the type of new fields), with some arbitrary selectors, which make it potentially quite useful for for analytics and monitoring.


And since it adds sharding, it scales pretty decently.

Sure, it won't be as good at CRUD operations, so it's not your primary database, but it can work great as a fast structured queryable thing.

Time seriesy

Time series databases are often used to show near-realtime graphs of things that happen, while also being archives.

They are aimed at being efficient at range queries,

and often have functionality that helps


InfluxDB

InfluxDB is primarily meant as a time series database, though can be abused for other things.

See also Influxdb notes


Is now part of a larger stack:

InfluxDB[13] - time series database
Telegraf[14] - agent used to ease collecting metrics. Some pluggable input / aggregation/ processing things
Kapacitor[15] - streams/batch processing on the server side
Chronograf[16] - dashboard.
also some interface fto Kapacitor, e.g. for alerts
Often compared to Grafana. Initially simpler than that, but more similar now

Flux[17] refers to a query language used in some places.


InfluxDB can be distributed, and uses distributed consensus to stay synced.

Open-source, though some features (like distribution) are enterprise-only.

TimescaleDB

Tied to Postgres

OpenTSDB

Graphite

See Data_logging_and_graphing#Graphite_notes

Storagey stuff - NewSQL

NewSQL usually points at distributed variants of SQL-accessed databases.


So things that act like RDBMSes in the sense of

giving of SQL features (not just a SQL-looking query language),
providing transactions,
and as ACID as possible while also proviiding some amount of scaling.


NewSQL is in some ways an answer to NoSQL that says "you know we can have some scaling without giving up all the features, right?"


YugabyteDB

  • distributed SQL
PostgreSQL API
Cassandra-like API
https://en.wikipedia.org/wiki/YugabyteDB


TiDB

MySQL interface
tries to be OLTP and OLAP
handles DDL better?
https://en.wikipedia.org/wiki/TiDB
in spired by Google Spanner
see aksi TiKV - TiDB is basically the SQL layer on top of TiKV (verify)


CockroachDB

https://en.wikipedia.org/wiki/CockroachDB


VoltDB

https://en.wikipedia.org/wiki/VoltDB



Unsorted (NewSQL)

Aurora (Amazon-hosted only)

MySQL and PostgreSQL API to something a little more scalable than those by themselves
https://en.wikipedia.org/wiki/Amazon_Aurora


Spanner

basically BigTable's successor
https://en.wikipedia.org/wiki/Spanner_(database)


H-Store

https://en.wikipedia.org/wiki/H-Store


See also: