Broad qualifiers, a.k.a. thinking about what you want from your database

From Helpful
(Redirected from Data lakes)
Jump to navigation Jump to search

Database related

More theoretical - thinking about databases:

Everyday-use notes

OLTP, OLAP

tl;dr:

  • OLTP basically means systems with immediacy, OLAP often means bulk/batch processing
  • Systems can be made to do both to some degree, but OLTP often comes with people wanting it to be as immediate as possible
  • RDBMSes are mostly used for OLTP, NoSQL has less opinion about how it's used


You'll see OLTP, OLAP as qualifiers.

sometimes as if they are opposites,
sometimes as if they are the only two kinds,
sometimes as specializations of specific kinds of systems/databases.

...all of which is a little overstated - but still useful.


Online transaction processing (OLTP)

  • tends mean "handle individual requests as they come in" - probably day-to-day transactions
  • as safely/serially as configured
  • as realtime as that still allows

This usually puts focus on

  • each request/operation being simple and fast
because if things take longer than milliseconds, it's going to scale very poorly, particularly if done serially
  • consistency within systems - strongly where this is simple
  • consistency between systems - where applicable and to the degree this is possible
  • tries to keep accesses simple - which is sometimes served best by non-flat data structures
that will require some consistency management, and that tends to be implemented in RDBMSes just so that you can put a lot of that on it
When you want scalability you often try to loosen that as much as you can get away with.


Online analytical processing (OLAP)

  • tends to refer to "analyse a bulk of data" setups.
  • what makes this unlike OLTP is
    • collecting chunks of data collected over a larger amount of time
    • often done for processing/summary to be accessed later, and not on request
      • classically was more about generating reports, analytics, etc. to be looked at later. ...where seeing it minutes to a day later is fine for reports
      • there are faster variants, but none of them have the OLTP "responses are measured in milliseconds, largely because they are user-facing" angle


On mixing the two

Can a single system do both? Sure.

But when the OLTP part of it is a critical, live system, you never want to hear "oh I'm just fetching all data, that halts everything for a minute, don't worry it'll continue afterwards", which is hard to guarantee never happens due to the bulk nature of OLAP style work.

As such, even if the software is the same, there is still a habit to siphon off data from an database used for OLTP to a separate database used for OLAP database and possibly a separate host for OLAP processing.

There are now also Hybrid transactional/analytical processing (HTAP) is the attempt to do both a transactional database and analytics, which sometimes amounts to an automated variant of doing just that (or just giving those guarantees on the same system).


The more at-your-leisure your OLAP work, the more that classic ETL style processing covers you needs, but these days, there are OLAP systems that aim to do real-ish time monitoring, and more explicitly balance scale and speed (also with NoSQL-ish scaling).

It's often still factors slower than OLTP, or at least behind a few seconds (e.g. when running off a replicated database server)
yet still many factors faster than generic warehousing, "throw an ETL script at it once every so often", "produce a report" style things),


OLAP might be based on structured data from a central OLTP system you have, or it might ingest less-structured data (like logs) in which case terms like 'warehouse' or 'lake' turn up.



As a broad qualifier, OLTP and OLAP can describe

  • a system as a whole - goals, database, ode, the hosts used
  • a type of database - e.g.
RDBMS is a fairly classical case of OLTP
most NoSQL (e.g. kv and document stores) are still largely on the OLTP side (if only because few do much processing themselves)
e.g. ElasticSearch used for log analysis or monitoring (see e.g. logstash, kibana) is more on the OLAP side
  • specific queries
e.g. some complex queries may be OLAP
queries may get so complex that people do some pre-structuring


It also has some applications to architecture. OLTP often implies tiered applications, whereas OLAP modelings is much more varied, often somewhat custom (though OLAP NoSQL does exist), from monolithic ETL batch script to complex interdependent streaming setups.

Lakes, warehouses, swamps, pits, and graveyards

Data warehouse means data in a processed, probably actively useful form.

This term has been around since the seventies.


Data lake means little more than "'not a data warehouse"

It often means "we're tracking things, we'll see how it's useful later"

This frequently also means storing it in raw from, again because you don't know what you want to extract from it. (...though if the thing is by nature already a structured document, this is more about whether it's indexed than whether it's parsed)


In some cases, data lakes involve so little thinking ahead that

  • you don't know what you have
  • you will have a really tough time figuring how to use or combine or even find this data later
  • you created what you could creatively call
data swamps (inaccessible)
data pits (won't produce clear insights)
data graveyards (you're not using it, and you never will)


Notes:

  • searching for 'Data lake' will mostly give you IoT related results -- mostly because IoT is very ill defined beyond "collect all the data (personal or not) and see what we can do with it later".


  • Data lake sometimes also includes the software that accesses the data - anything from ETL/ELT, and sometimes much fancier analytics.
more so of this is a company trying to sell you analytics software


  • In practice, though,
    • data warehouse and data lake are often used interchangeably.
    • (so) when data warehousing can mean both, then ETL and ELT are both seen in data warehousing


ETL, ELT?

ETL (Extract, Transform, Load)

is what data warehousing historically implies:
"I clean up my data before entering it into the system"
....in terms of data types, data validity, and change the shape of to fit a model/schema you have already settled, on at design stage
...often so that it can be stored in a structured database, meaning
you have specific goals in mind
so have already thought about efficient querying, and ensured faster loading
and you need to define that database up front
you can design it in a way that likely queries will be fast



ELT (Extract, Load, Transform) mostly means

"just store what we got, in the form we got it in (probably a text blob, or even a binary blob), and worry about interpreting it later, maybe never"
Data lakes are arguably defined by this "it is not yet structured, queryable data" approach (contrasted to data warehouse, which is more ETL/usable)

This is in part pragmatism:

  • if there's data we don't use yet, and some we may never use, why spending time creating a model and polishing data before we can even store it - we can do that work if when we need it, if we ever want to
  • it doesn't need to have the form that the original design assumption require
  • we can capture any and all data, and might be able to ask more varied questions, and get more answers, if you keep the original data around.


However, the same thing that leads to these upsides (you didn't do anything yet) lead to downsides:

  • you will parse and validate it on use
...either on every use
this tends to be true for ad-hoc querying, which tends to mean mapreduce / pipeline solutions
this is usually also slower and/or consumes more power
...or on first use, e.g. if you validate-and-store into a working set for every question you work on
meaning you store even more data, and take more time (sort of because you're creating a warehouse for each use)


https://en.wikipedia.org/wiki/Data_warehouse

"NoSQL versus relational"

Lookups, joins, flatness, and locality

Responsibilities in database versus responsibilities in code

Why denormalize, why not denormalize?

ORMs considered good and bad

On consistency

The n+1 problem