Broad qualifiers, a.k.a. thinking about what you want from your database

From Helpful
(Redirected from Data warehouse)
Jump to navigation Jump to search
Database related

More theoretical - thinking about databases:

Everyday-use notes

OLTP, OLAP

tl;dr:

  • OLTP basically means systems with immediacy, OLAP often means bulk/batch processing
  • Systems can be made to do both to some degree, but OLTP often comes with people wanting it to be as immediate as possible
  • RDBMSes are mostly used for OLTP, NoSQL has less opinion about how it's used


You'll see these two as qualifiers, as kinds of setups, and sometimes as specializations of specific kinds of systems/databases.


Online transaction processing (OLTP) refers to any system that does day-to-day transactions

  • handles individual requests as they come in
  • as safely/serially as configured
  • as realtime as that still allows

This usually puts focus on

  • operations being simple and fast
because if things take longer than milliseconds, it's going to scale very poorly, particularly if done serially
  • consistency within systems - strongly where this is simple
  • consistency between systems - where applicable and to the degree this is possible


OLTP that uses more complex data models tends to be implemented in RDBMSes, because the consistency you then often want can be done for you. When you want scalability you often try to loosen that as much as you can get away with.


Online analytical processing (OLAP) tends to refer to "analyse a bulk of data" setups.

This often means that, mostly unlike OLTP

collecting chunks of data collected over a larger amount of time
often done for business reasons, rather than direct use by users.
classically was more about generating reports, analytics, etc. to be looked at later - There are few cases where a response coming seconds later is any problem (or days, if we're talking about reports)
now more frequently also to be looked at live (scale of seconds), and is starting to overlap with more complex variants of health monitoring
(compare each to OLTP, where responses are measured in milliseconds)


On mixing the two

When doing OLAP, you could just send batch-like queries to your one system that does everything, but when that system is a critical, live OLTP system, the bulk nature of OLAP style work could slow down OLTP due to its "let's be careful about reading when something is writing" nature.

As such, there is a habit to siphon off data from an database used for OLTP to a separate database used for OLAP database (and its own processing host).

There are now also Hybrid transactional/analytical processing (HTAP) is the attempt to do both a transactional database and analytics, which sometimes amounts to an automated variant of doing just that.


The more at-your-leisure your OLAP work, the more that classic ETL style processing covers you needs, but these days, there are OLAP systems that aim to do real-ish time monitoring, and more explicitly balance scale and speed (also with NoSQL-ish scaling).

It's often still factors slower than OLTP, or at least behind a few seconds (e.g. when running off a replicated database server)
yet still many factors faster than generic warehousing, "throw an ETL script at it once every so often", "produce a report" style things),


OLAP might be based on structured data from a central OLTP system you have, or it might ingest less-structured data (like logs) in which case terms like 'warehouse' or 'lake' turn up.



As a broad qualifier, OLTP and OLAP can describe

  • a system as a whole - goals, database, ode, the hosts used
  • a type of database - e.g.
RDBMS is a fairly classical case of OLTP
most NoSQL (e.g. kv and document stores) are still largely on the OLTP side (if only because few do much processing themselves)
e.g. ElasticSearch used for log analysis or monitoring (see e.g. logstash, kibana) is more on the OLAP side
  • specific queries
e.g. some complex queries may be OLAP
queries may get so complex that people do some pre-structuring


It also has some applications to architecture. OLTP often implies tiered applications, whereas OLAP modelings is much more varied, often somewhat custom (though OLAP NoSQL does exist), from monolithic ETL batch script to complex interdependent streaming setups.

Lakes, warehouses, swamps, pits, and graveyards

Data warehouse means data in a processed, probably actively useful form.

This term has been around since the seventies.


Data lake means little more than "'not a data warehouse"

It often means "we're tracking things, we'll see how it's useful later"

This frequently also means storing it in raw from, again because you don't know what you want to extract from it. (...though if the thing is by nature already a structured document, this is more about whether it's indexed than whether it's parsed)


In some cases, data lakes involve so little thinking ahead that

  • you don't know what you have
  • you will have a really tough time figuring how to use or combine or even find this data later
  • you created what you could creatively call
data swamps (inaccessible)
data pits (won't produce clear insights)
data graveyards (you're not using it, and you never will)


Notes:

  • searching for 'Data lake' will mostly give you IoT related results -- mostly because IoT is very ill defined beyond "collect all the data (personal or not) and see what we can do with it later".


  • Data lake sometimes also includes the software that accesses the data - anything from ETL/ELT, and sometimes much fancier analytics.
more so of this is a company trying to sell you analytics software


  • In practice, though,
    • data warehouse and data lake are often used interchangeably.
    • (so) when data warehousing can mean both, then ETL and ELT are both seen in data warehousing


ETL, ELT?

ETL (Extract, Transform, Load)

is what data warehousing historically implies:
"I clean up my data before entering it into the system"
....in terms of data types, data validity, and change the shape of to fit a model/schema you have already settled, on at design stage
...often so that it can be stored in a structured database, meaning
you have specific goals in mind
so have already thought about efficient querying, and ensured faster loading
and you need to define that database up front
you can design it in a way that likely queries will be fast



ELT (Extract, Load, Transform) mostly means

"just store what we got, in the form we got it in (probably a text blob, or even a binary blob), and worry about interpreting it later, maybe never"
Data lakes are arguably defined by this "it is not yet structured, queryable data" ELT approach (contrasted to data warehouse, which is more ETL/usable)

This is in part pragmatism:

  • if there's data we don't use yet, and some we may never use, why spending time creating a model and polishing data before we can even store it - we can do that work if when we need it, if we ever want to
  • it doesn't need to have the form that the original design assumption require
  • we can capture any and all data, and might be able to ask more varied questions, and get more answers, if you keep the original data around.


However, the same thing that leads to these upsides (you didn't do anything yet) lead to downsides:

  • you will parse and validate it on use
...either on every use
this tends to be true for ad-hoc querying, which tends to mean mapreduce / pipeline solutions
this is usually also slower and/or consumes more power
...or on first use, e.g. if you validate-and-store into a working set for every question you work on
meaning you store even more data, and take more time (sort of because you're creating a warehouse for each use)


https://en.wikipedia.org/wiki/Data_warehouse

"NoSQL versus relational"

Lookups, joins, flatness, and locality

Responsibilities in database versus responsibilities in code

Why denormalize, why not denormalize?

ORMs considered good and bad

On consistency

The n+1 problem