Some databases, sorted by broad types
Database related
More theoretical - thinking about databases:
Everyday-use notes |
File databases
Not a model but a practicality:
File databases are useful to persist and store moderately structured data on disk,
- often storing many items in one (or a few files)
- usually accessible via a single, simple library - meaning they are also considered an embedded database
There is typically also some way to alter that data in terms of its items (rather than, as in serialization, the whole thing at once),
which may be limited, or just not very performant.
The goal is often a mix of
- persist a moderate set of possibly-somewhat-complex data without having to
- do the low-level storage
- think about efficient lookup so much
- (demand of users to) install and configure a database engine
- think about such an database's networking, auth (because there is none)
- (because it need only) support only a single program/client
- relatively read-heavy
- avoids requiring an external process
- that has to be running and someone has to start and be responsible for
key-value file databases
The simpler variants may be based somewhat on the older, simple, yet effective dbm, a (family of) database engine(s) that stores key-value mappings (string to string) in a file (sometimes two, splitting out some metadata/index(verify)).
These may be understood as an on-disk variant of a hashmap, allows rebucketing to have the hashing scale well, and fixed-size buckets to allow relatively efficient modification.
They are also libraries, so easy to embed and avoid network serving details.
dbm family
Within the dbm family, the ones that are interesting enough to use are probably roughly:
- dbm, 'database manager' (1979, AT&T)
- ndbm, 'new database manager' (1986, Berkeley), added support for the library to deal with multiple open databases
- sdbm, a 1987 clone of ndbm, made for licensing reasons
- gdbm, 'GNU database manager', also added arbitrary-length data
- Berkeley DB (a.k.a. BDB and sometimes BSD DB), optional concurrent access, transactions and such (ACIDity), etc.
While not part of the same family, there are also continuations / successors of this idea, including
- Tokyo Cabinet and related are basically a modern reimplementation of the dbm idea
- tdb (trivial database library): made for samba suite (so relatively recent), API like GDBM but safely allows concurrent writers [1]
- tdbm: variant of ndbm with atomic transactions, in-memory databases
- not to be confused with the ORM called "the database machine" and also abbreviating to tdbm
- MDBM: memory-mapped key-value database store derived (like sdbm and ndbm) [2]
- QDBM [3]
Berkeley DB notes
Berkeley DB, also known as BDB and libdb, is basically a key-value map in a file.
It is a library instead of a server, so can be embedded, and is used like that quite a bit.
For simpler (e.g. not-very-relational) ends it has lower and more predictable overhead than bulkier databases.
Technical notes
There is a low-level interface, that does not support concurrent write access from multiple processes.
It also has some higher-level provisions for locking, transactions, logging, and such, but you have to choose to use them, and then may want to specify whether it's to be safe between threads and/or processes, and other such details.
From different processes, you would probably use DBEnv(verify) to get BDB to use proper exclusion. Most features you have to explicitly ask for via options - see things like DBEnv::open() (e.g. DB_THREAD, (lack of) DB_PRIVATE), and also notes on shared memory regions.
Interesting aspects/features:
- it being a library means it runs in the app's address space, minimizing cross-process copying and required context switches
- caching in shared memory
- option for mmapped read-only access (without the cache)
- option to keep database in memory rather than on disk
- concurrent access:
- writeahead logging or MVCC
- locking (fairly fine-grained)
- transactions (ACID), and recovery
- hot backups
- Distribution:
- replication
- commits to multiple stores (XA interface), (since 2.5)
Both key and value are byte arrays; the application has to decide how it wishes to format and use data.
- Both key and value can be 232 bytes (4GB, though for keys that's usually not a great idea).
- A database file up to 248 bytes (256TB, which is more than various current filesystem limits).
It uses a cache to avoid lookup slowness, and a write-back cache to be more write-efficient.
Format/access types
There are multiple types of access / file format. They provide mostly the same functionality (keyed access as well as iteration over the set); the difference mostly in performance, and only when the data is large, since if all data fits in the cache, this is an near-non-issue.
For larger data sets you should consider how each type fits the way you access your data.
If your keys do not order the entries, you should consider hash or btree. When keys are ordered record numbers, you should probably go with recno, a.k.a. record, (fixed or variable-length records).
You can supply your own comparison and hash functions.
More details:
- Hash (DB_HASH)
- uses extended linear hashing; scales well and keeps minimal metadata
- supports insert and delete by key equality
- allows iteration, but in arbirtary order
- B+tree (DB_BTREE)
- ordered by keys (according to the comparison function defined at creation time. You can use this for access locality)
- allows lookup by range
- also keeps record numbers and allows access by them, but note that these change with changes in the tree, so are mostly useful for use by recno:
- recno (DB_RECNO)
- ordered records
- fast sequential access
- also with key-based random access - it is actually built on B+tree but generates keys internally
- queue (DB_QUEUE)
- fixed-size records
- fast sequential access
You can also open a BDB using DB_UNKNOWN, in which case the open call determines the type.
There are provisions to join databases on keys.
Versions and licenses
it's hard to find good details
Early versions came from a library for simple hash, and later b-tree records.
- BDB 1.85 was a version for BSD 4.4, released in 1992 under a BSD-like license (that makes it behave like the GPL)
A Netscape request for further features led to the creation of Sleepycat Software in 1996
- BDB 2.1 was released in 1997, adding concurrent access, transactions
Sleepycat Software was acquired by Oracle Corporation in 2006. This seems to have had little licensing consequences on versions since then (verify)
Versions 2 and later, including the Oracle additions, are dual-licensed, under either:
- The Sleepycat Commercial License: A purchased license that does not force redistribution
- The Sleepycat Public License: software that uses Berkeley DB must be distributed (note: excludes in-house development)
- apparently, the case of language support/wrapping is a special one (not unlike LGPL) as in cases like python and perl interfaces, the software that uses BDB is the language, not any scripts that use that interface/BDB. This does not seem to apply in the case of C library use(verify).
In other words, you generally have three options:
- use freely while the application is and stays entirely personal / internal to your business
- you have to distribute the source of the application that uses BDB
- you have to get a paid license
There are now three products:
- Berkeley DB (the basic C library)
- Berkeley DB Java, a Java version with options for object persistence
- Berkeley DB XML, an XML database using XQuery/XPath (see e.g. [4])
Some additional licenses apply to the latter.
See also
- http://en.wikipedia.org/wiki/Sleepycat_license
- http://www.oracle.com/technology/software/products/berkeley-db/htdocs/licensing.html
Tokyo Cabinet / Kyoto Cabinet
database library, so meant for single-process use.
Tokyo Cabinet (2007) (written in C) is a embedded key-value database, a successor to QDBM
- on-disk B+ trees, hash tables, or fixed-length array
- multithreaded
- some transaction
- no real concurrent use
- process-safe via exclusion control (via file locking), but only one writer can be connected at a time
- threadsafe (meaning what exactly?)
Kyoto Cabinet (2009) is intended to be a successor.
written in C++, the code is simpler than Tokyo, intends to work better around threads. (Single-thread seems a little slower than Tokyo)
Comparison:
- Tokyo may be a little faster
- Tokyo may be a little more stable (at leas in earlier days of Kyoto's development)
- Kyoto may be simpler to use
- Kyoto may be simpler to install
LightningDB / LMDB
Lightning Memory-Mapped Database, a.k.a. LightningDB (and MDB before a rename)
- Ordered-map store
- ACID via MVCC
- concurrency
https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database
cdb
Also: some new things
e.g. LevelDB and RocksDB, on this page filed under kv store
Relational file databases
SQLite
SQLite is a database engine, based on files accessed via a library, rather than a service.
Below is some introduction. For more specific/technical use notes, see SQLite notes
No server, no admin rights required to install or start it or to use a network port or (not) firewall that port.
A database is a single file (with some helper files that appear temporarily but you typically don't have to think about), and there is no configuration (...beyond things you do per-file-database, some of which are stored inside, some of which you have to be consistent about).
It's not optimized to be multi-client. Functionally you can get that to some degree, but if you do alterations, don't expect it to scale well.
It's a little fancier than various other library-only databases, in that it e.g. supports
- recovery via journaling
- concurrent access (...to a degree)
- most of SQL92
- ACIDity
- constraints
- indices
- views (to a degree)
Cases where SQLite is interesting:
- embedded things - SQLite and BDB have both been fairly common here for a good while
- giving an application memory between runs
- without reinventing various wheels
- interchange of nontrivial data between different programs / languages
- (that doesn't need to be minimum-latency)
- when accessed via a generic database interfaces, you can fairly easily switch between sqlite to real rdbms
- e.g. during development it's useful to do tests with an sqlite backend rather than a full installed database
- storage for simple dynamic websites (without needing a database, though on shared hosting this is usually just there anyway)
- particularly if mostly read-only
- Creative uses, such as local caching of data from a loaded central database
- whether persistent
- or not - you can have memory-only databases
Limitations, real or perceived:
- for concurrency and ACIDity from multiple clients, it requires file locking
- particularly network mounts tend not to implement that fully (some filesystems skimp too)
- while sqlite will function with multiple users, assume it will perform better with fewer users (or only simple interactions)
- no foreign keys before 3.6.19 (can be worked around with triggers) and they're still turned off by default
- no VIEW writing (verify)
- triggers are somewhat limited (verify)
Unlike larger database engines
- the less-common RIGHT JOIN and FULL OUTER JOIN are unimplemented (verify) (but they're not used much, and you can rewrite queries)
- no permission system (...it'd be fairly pointless if you can read the file anyway)
- it is dynamically typed, meaning the type can vary per row, regardless of column type
- This is probably not your initial expectation (certainly unlike most RDBMS), and you need to know what it does.
- this is sometimes convenient, and sometimes weird reason you need extra wrangling
Like larger database engines
- How well it performs depends on some choices
- Autocommit (often the default) is good for concurrency, bad for performance (and sqlite libraries may add their own behaviour)
- for larger things you want well-chosen indices
- some cases of ALTER TABLE basically imply creating a new table
-->
Array oriented file databases
CDF
CDF (Common Data Form) is a primarily array-oriented format, which seems to have been designed to use and to archive large amounts of sensor data, with it being self-described, appendable, scalable enough, with one writer and multiple readers.
...so bears some similarities to HDF5 in intent, and is found in similar contexts.
In fact, NetCDF4 chose HDF5 as a storage layer under the covers,
so libraries could easily deal with both CDF and HDF5 (though CDF data representation, especially classic, is more restricted in practical use).
In different contexts it may refer to an abstract data model, an API to access arrays, a data format, or a particular implementation of all of that.
This is a little confusing, in that some of those have been pretty constant, and others have not. In particular, the data format for netCDF can be roughly split into
- classic (since 1989)
- classic with 64-bit offsets (since 2004)
- netCDF-4/HDF5 classic (since 2008)
- netCDF-4/HDF5 'enhanced' (since 2008)
The first three are largely interchangeable at API level, while the last allows more complex data representations that cannot be stored in classic.
See also:
And perhaps
Hierarchical Data Format (HDF)
HDF describes a data model, library, and file format.
Now typically meaning HDF5 (though HDF4 still sees some use).
Goals/features:
- hierarchical refers to the fact that addressing-wise, it basically implements filesystem-like names within it
- Stores various array-like stuff, and halfway clever about readout of parts from huge datasets.
- structured, potentially complex data
- primarily cases where you have many items following the same structure, and often numerical data (but can store others)
- think time series, large sets of 2D or 3D imagery (think also meteorology), raster images, tables, other n-dimensional arrays
- fast lookup on predictable items
- offset lookups in arrays, B-tree indexes where necessary; consideration for strides, random access, and more.
- a little forethought helps, though
- portable data (a settled binary format, and self-described files)
- dealing with ongoing access to possibly-growing datasets
- parallel parallel IO has been considered
- multiple applications accessing a dataset, parallelizing IO accesses, allows is use on clustered filesystems
See also
- http://en.wikipedia.org/wiki/HDF5
- http://www.hdfgroup.org/
- https://support.hdfgroup.org/HDF5/Tutor/HDF5Intro.pdf
Apache Arrow
Arrow primarily is an abstraction on working with certain types of data.
Arrow is not a database., in the sense that you cannot assume that something that implements it lets you alter data efficiently.
Yet Arrow does propose to efficiently serialize data and pass pass data around, e.g.
- thinking about keeping things relatively light in-memory
- zero-copy where shared memory is possible (which is far from all cases),
- prefer to streams in chunks if/where copies are necessary
- fast to iterate and
- preferably near-O(1) for random access - though this does not hold (at all) for various of its serialisations
- think about parallel IO
It prefers to work with column-style data, and lend itself to table-like things
though technically is more like hybrid columnar,
there's no reason it won't store other types of data.
Arrow is not a single serialization/interchange/storage format.
...there are multiple, and they have distinct properties. And yes, this makes "arrow file" confusingly ambiguous.
- The IPC/feather format (uncompressed, more about faster interchange)
- in streaming style OR
- in random access style
- optionally memory mapped (and can be by multiple processes)
- Parquet (compressed)
- more about storage-efficient archiving
- much less about efficient random access
- (you can also load more classical formats like CSV and JSONL but lose most of the efficient lazy-loading features)
There is also pyarrow.dataset which
- read or write partitioned sets of files
- can selectively load columns
- and filter by values
- Can load from S3 (and MinIO) and HDFS
https://en.wikipedia.org/wiki/Apache_Arrow
Unsorted (file databases)
Other/similar include:
Unsorted
Tokyo (and Kyoto)
Tokyo Tyrant / Kyoto Tycoon
database server, so useful when you want some concurrent use.
Supports expiry, so can act much like memcached
Tokyo Dystopia
full-text search system
http://fallabs.com/tokyodystopia/
Tokyo Promenade
content management system built around cabinet.
Presentable in BBS, blog, and Wiki style
http://fallabs.com/tokyopromenade/
HamsterDB
https://github.com/GerHobbelt/hamsterdb
Which do I use?
Storagey stuff - kv or document stores
Ask for a key, get a blob of uninterpreted data, or map of somewhat structured data.
MongoDB notes
tl;dr
- Weakly typed, document-oriented store
- values can be lists
- values can be embedded documents (maps)
- searchable
- on its fields, dynamic
- e.g. db.collname.find({author:"mike"}).sort({date:-1}).limit(10) - which together specifies a single query operation (e.g. it always sorts before limiting [7])
- supportable with indices [8]
- field indexes - basic index
- compound indexes - indexes a combination, e.g. first looking for a userid, then something per-userid)
- multikey indexes - allows matching by one of the values for a field
- 2d geospatial - 'within radius', basically
- text search
- indexes can be:
- hash index - equality only, rather than the default sorted index (note: doesn't work on multi-key)
- partial index - only index documents matching a filter
- sparse index - only index documents having that have the field
- sharding, replication, and combination
- replication is like master/slave w/failover, plus when the primary leaves a new primary gets elected. If it comes back it becomes a secondary to the new primary.
- attach binary blobs
- exact handling depends on your driver[9]
- note: for storage of files that may be over 16MB, consider GridFS
- sort of like JSON, but binary, and has some extra things (like a date type)
- Not the fastest NoSQL variant in a bare-metal sense
- but often a good functionality/scalability tradeoff for queries that are a little more complex than just than key-value
- no transactions, but there are e.g. atomic update modifiers ("update this bunch of things at once")
- 2D geo indexing
- GridFS: chunking large files and actually having them backed by mongo.
- point being you can get a distributed filesystem
- mongo shell interface is javascript
Schema considerations
- One to one relationships? You probably want it in the same document. For most It saves extra lookups.
- E.g. a user's address
- One to many? Similar to one-to-many.
- E.g. a user's address when they have more than one.
- Always think about typical accesses and typical changes.
- For example, moving an entire family may go wrong because values have to change in many placers. (but then, it often might in RDBMS too because a foreign key would have to change in many places)
- foreign-key-like references can be in both places, because values can be lists and queries can search in them
- Usually, avoid setups where these lists will keep growing.
- when you refer to other documents where contents will not change, you could duplicate that part if useful for e.g. brief displays, so you can do those without an extra lookup.
- sometimes such denormalized information can actually be a good thing (for data-model sanity)
- e.g. the document for an invoice can list the exact text it had, plus references made. E.g. updating a person's address will not change the invoice -- but you can always resolve the reference and note that the address has since changed.
- if you know your product will evolve in production, add a version attribute (and add application logic that knows how to augment previous versions to the current one)
Also there's a document size limit
You can set _id yourself,
but if you don't it'll get a unique, GUID-like identifier.
See also:
GUI-ish browsers:
https://www.mongodb.com/products/compass
riak notes
Key-value store with a focus on concurrency and fault tolerance,
Pluggable backends, e.g. allowing use as just a memcache, or give it persistence.
Eventually consistent (with some strong-consistency experiments?(verify))
masterless
Ideally you fix the cluster size ahead of time. When you add nodes, contents are redistributed (verify)
Backends include
- bitcask
- all keys in hashtable in RAM (fast, but limiting the amount of items via available RAM)
- file copy = hot backup (verify)
- leveldb
- keys stored on-disk
- secondary indexes, so limited limited relational-style querying at decent performance
- data compression
- no hot backup
- innostore
- memory
- objects in ram
It aims to distribute perfectly (and supports some other features by assumptions), which implies you have to fix your cluster size ahead of time
Pluggable backends mean you can have it persist (default) or effectively be a distributed memcache
etcd
Key-value store, distributed by using Raft consensus.
It is arguably a Kubernetes / Google thing now and has lessening general value,
in part due to requiring gRPC and HTTP/2, and threatening to abandon its existing API.
CouchDB notes
(not to be confused with couchbase)
Document store.
Made to be compatible with memcachedb(verify), but with persistence.
- structured documents (schemaless)
- can attach binary blobs to documents
- RESTful HTTP/JSON API (to write, query)
- so you could do with little or no middle-end (you'll need some client-side rendering)
- shards its data
- eventually consistent
- ACIDity per doment operation (not larger, so inherently relational data)
- no foreign keys, no transactions
- running map-reduce on your data
- Views
- best fit for mapreduce tasks
- Replication
- because it's distributed, it's an eventually consistent thing - you have no guarantee of delivery, update order, or timeliness
- which is nice for merging updated made remotely/offline (e.g. useful for mobile things)
- and don't use it as a message queue, or other things where you want these guarantees
- revisions
- for acidity and conflict resolution, not in a store-forever way.
- An update will conflict if someone did an update based on the same version -- as it should.
- Couchapps,
document ~= row
Notes:
- view group = process
- nice way to scale
- sharding is a bit harder
Attachments
- not in views
- if large, consider CDNs, a simpler nosql key-val store, etc.
See also:
- http://couchdb.apache.org/
- http://couchdb.apache.org/docs/overview.html
- http://couchdb.apache.org/docs/intro.html
PouchDB
Javascript analogue to CouchDB.
Made in part to allow storage in the browser while offline, and push it to CouchDB later, with minimal translation.
Couchbase notes
(previously known as Membase) (not to be confused with CouchDB)
CouchDB-like document store, plus a memcached-compatible interface
Differences to CouchDB include:
- typing
- optional immediate consistency for individual operations
- allows LDAP auth
- declarative query language
- stronger consistency design
TiKV
Distributed key-value with optional transactional API
See also TiDB, which is basically an SQL layer on top that makes it NewSQL.
FoundationDB
Mixed-model but seems built on (ordered) kv.
Networked, can be clustered (replication, partitioning).
Does transactions, serializable isolation, so actually tries towards ACIDity.
Tries to avoid some issues by not allowing transactions to run more than a few seconds, exceed 10MB of writes, and other limitations like that.
https://en.wikipedia.org/wiki/FoundationDB
LevelDB
On-disk key-value store
License: New BSD
https://en.wikipedia.org/wiki/LevelDB
https://github.com/google/leveldb
RocksDB
A Facebook fork of LevelDB, focusing on some specific performance details
https://en.wikipedia.org/wiki/RocksDB
MonetDB
Column store
https://www.monetdb.org/Documentation/Manuals/MonetDB/Architecture
hyperdex
key-value or document store.
stronger consistency guarantees than various other things
supports transactions on more than a single object - making it more ACID-style than most any nosql. By default is looser, for speed(verify)
very interesting performance
optionally applies a schema
Most parts are BSD license. The warp add on, which provides fault-tolerant transactional part, is licensed.
The evaluation variant you get omits the fault-tolerance guarantee part.
RethinkDB
Document store with a push mechanism, to allow easier/better real-timeness than continuous polling/querying.
Storagey stuff - wide-column style
As wide column is arguably just a specialized use of a kv store, but one that deserves its own name because it is specifically optimized for that use.
Cassandra
Distributed wide-column store
Uses CQL for its API (historically that was Thrift)
Seems to scale better than others, though apparently at cost of general latency
masterless
People report (cross-datacenter) replication works nicely.
Values availability, with consistency between being somewhat secondary NOTSURE (verify)
(if consistency is more important than availability, look at HBase instead)
See also:
http://basho.com/posts/technical/riak-vs-cassandra/
Hbase
An implementation imitating Google's Bigtable, part of Hadoop family (and built on top of HDFS).
See also:
- http://labs.google.com/papers/bigtable.html for the article on Bigtable
- http://wiki.apache.org/hadoop/PerformanceTuning
- http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation
- stored on a per-column family basis
Hypertable
See also:
Accumulo
Storagey stuff - graph style
This one is mostly about the way you model your data, and the operations you can do, and do with fair efficienct.
..in that you can use e.g. key-value stores in a graph-like ways, and when you don't use the fancier features, the two may functionally be hard to tell apart.
ArangoDB
Apache Giraph
Neo4j
OrientDB
FlockDB
AllegroGraph
http://franz.com/agraph/allegrograph/
GraphBase
GraphPack
Cachey stuff
redis notes
tl;dr
- key-value store
- with things like counting
- typed - types/structures include: counter, list, set, sorted set (hash-based), hash, bitarray (and some geo data types)
- that you get a basic set of queries/operations for
- allows transactions, which lets you do your own atomic updates when necessary
- pub/sub [11]
- (completely separate from the keyspace)
- can be sharded [12]
- shard mapping is in-client -- so must be consistent if you use it as a store (rather than a cache)
- https://www.tutorialspoint.com/redis/redis_partitioning.htm
- primarily in-memory
- though allows persisting by
- logging requests for recovery that way (AOF), and/or
- snapshotting current state, e.g. every hour (RDB)
- ...but both are still more of a persistent cache than a primary store
Uses
- a memcache of sorts (session data, common resources)
- intermediate between services (see pub/sub)
- rate limiting between
On commands:
- most/all types
- SET, MGET
- GET, MSET
- SETNX - set only if did not exist
- SETEX - set value and expiration
- MSETNX
- MSETEX
- GETSET
- DEL
- EXISTS
- TYPE
- EXPIRE, TTL, PERSIST
- http://redis.io/commands#generic
- http://redis.io/commands#keys
- integer / counter
- GET, MGET
- INCR
- DECR
- INCRBY
- DECRBY
- float value
- INCRBYFLOAT
- bytestrings
- STRLEN
- APPEND
- GETRANGE (substring)
- SETRANGE
- bitstrings
- GETBIT
- SETBIT
- BITCOUNT
- BITOP
- BITPOS
- list
- LPUSH, RPUSH
- LPOP, RPOP
- LRANGE (fetch by index range)
- LTRIM, RTRIM (throw away al)
- BLPOP, BRPOP - block until present (until timeout, waiting clients served in order, can wait on multiple lists -- all behaviour useful for things like producer-consumer pattern)
- RPOPLPUSH, BRPOPLPUSH - may be more appropriate when building queues
- http://redis.io/commands#list
- hash
- HSET, HGET
- HMSET, HMGET (multiple)
- http://redis.io/commands#hash
- transactions:
- MULTI, ended by either EXEC or DISCARD
- WATCH, UNWATCH
- http://redis.io/commands#transactions
memcached
memcached is a memory-only key-value store
- evicts items
- by least use when memory limit is hit (LRU style, geared to keep most-used data)
- by explicit eviction time, if given
- client can shard to multiple backends - an approach that has some some footnotes
- no persistence, no locks, no complex operations (like wildcard queries) - which helps it guarantee low latency
It is not:
- storage. It's specifically not backed by disk
- a document store. Keys are limited to 250 characters and values to 1MB
- if you want larger, look at other key-value stores (and at distributed filesystems)
- redundant
- You are probably looking for a distributed filesystem if you are expecting that. (you can look at memcachedb and MogileFS, and there are many others)
- a cacheing database proxy
- you have to do the work of figuring out what to cache
- you have to do the work of figuring out dependencies
- you have to do the work of figuring invalidations
Originally developed for livejournal (by Danga Interactive) and released under a BSD-style license.
See also
Searchy stuff
ElasticSearch
See also Lucene and things that wrap it#ElasticSearch
Note that it is easy to think of Lucene/Solr/ES as only a text search engine.
But it is implemented as a document store, with basic indices that act as selectors.
So it is entirely possibly to do more structured storage (and fairly schemaless when ES guesses the type of new fields),
with some arbitrary selectors, which make it potentially quite useful for for analytics and monitoring.
And since it adds sharding, it scales pretty decently.
Sure, it won't be as good at CRUD operations, so it's not your primary database, but it can work great as a fast structured queryable thing.
Time seriesy
Time series databases are often used to show near-realtime graphs of things that happen, while also being archives.
They are aimed at being efficient at range queries,
and often have functionality that helps
InfluxDB
InfluxDB is primarily meant as a time series database, though can be abused for other things.
See also Influxdb notes
Is now part of a larger stack:
- InfluxDB[13] - time series database
- Telegraf[14] - agent used to ease collecting metrics. Some pluggable input / aggregation/ processing things
- Kapacitor[15] - streams/batch processing on the server side
- Chronograf[16] - dashboard.
- also some interface fto Kapacitor, e.g. for alerts
- Often compared to Grafana. Initially simpler than that, but more similar now
Flux[17] refers to a query language used in some places.
InfluxDB can be distributed, and uses distributed consensus to stay synced.
Open-source, though some features (like distribution) are enterprise-only.
TimescaleDB
Tied to Postgres
OpenTSDB
Graphite
See Data_logging_and_graphing#Graphite_notes
Storagey stuff - NewSQL
NewSQL usually points at distributed variants of SQL-accessed databases.
So things that act like RDBMSes in the sense of
- giving of SQL features (not just a SQL-looking query language),
- providing transactions,
- and as ACID as possible while also proviiding some amount of scaling.
NewSQL is in some ways an answer to NoSQL that says "you know we can have some scaling without giving up all the features, right?"
YugabyteDB
- distributed SQL
- PostgreSQL API
- Cassandra-like API
- https://en.wikipedia.org/wiki/YugabyteDB
TiDB
- MySQL interface
- tries to be OLTP and OLAP
- handles DDL better?
- https://en.wikipedia.org/wiki/TiDB
- in spired by Google Spanner
- see aksi TiKV - TiDB is basically the SQL layer on top of TiKV (verify)
CockroachDB
VoltDB
https://en.wikipedia.org/wiki/VoltDB
Unsorted (NewSQL)
Aurora (Amazon-hosted only)
- MySQL and PostgreSQL API to something a little more scalable than those by themselves
- https://en.wikipedia.org/wiki/Amazon_Aurora
Spanner
- basically BigTable's successor
- https://en.wikipedia.org/wiki/Spanner_(database)
H-Store
See also: