Data structures: Difference between revisions

From Helpful
Jump to navigation Jump to search
mNo edit summary
mNo edit summary
Line 4: Line 4:
===On the dictionary problem===
===On the dictionary problem===
<!--
<!--
The '''dictionary problem''' refers to what happens when you have a lot of records that you not only want to search, but also insert and delete as you go.


Inserting or removing one word into a paper dictionary basically means having to move everything after it,
It isn't necessarily easy when you do not only want to search, but insert and delete as time goes on.
and the same would be true for more basic data structures.


It's not that you can't, it's more than a subset of operations take an amount of work proportional to the amount of items you store, which makes it worse and worse as a data structure that may ever have to handle a lot of items.
 
The '''dictionary problem''' points out this issue when you write a paper dictionary - inserting and deleting means having to move everything after it,
 
The same would be true for the most basic basic data structures in computing.
 
It's not that you can't (and it won't take as long as doing so to a paper dictionary),
it's more that a subset of operations take an amount of work proportional to the amount of items you store,
which makes it worse and worse as a data structure that may ever have to handle a lot of items.


A basic array means means search, insert, and delete all take time proportional to the amount of elements.
A basic array means means search, insert, and delete all take time proportional to the amount of elements.

Revision as of 16:14, 14 July 2023

Some fragmented programming-related notes, not meant as introduction or tutorial

Data: Numbers in computers ·· Computer dates and times ·· Data structures

Wider abstractions: Programming language typology and glossary · Generics and templating ·· Some abstractions around programming · · Computational complexity theory notes · Synchronous, asynchronous · First-class citizen

Syntaxy abstractions: Constness · Memory aliasing · Binding, assignment, and such · Closures · Context manager · Garbage collection

Sharing stuff: Communicated state and calls · Locking, data versioning, concurrency, and larger-scale computing notes

Language specific: Python notes ·· C and C++ notes · Compiling and linking ·· Lua notes

Teams and products: Programming in teams, working on larger systems, keeping code healthy · Benchmarking, performance testing, load testing, stress testing, etc. · Maintainability

More applied notes: Optimized number crunching · File polling, event notification · Webdev · GUI toolkit notes

Mechanics of duct taping software together: Automation, remote management, configuration management · Build tool notes · Installers


On the dictionary problem

associative array, hash map, map, hash table, symbol table, dictionary

Bloom filter

"A Bloom filter is a space-efficient probabilistic data structure [...] used to test whether an element is a member of a set"[1], with occasional false positives (and quantifiable amount of them), but no false negatives.


A little more mechanically put, a bloom filter is

a fixed-size structure (a balance between 'small meaning not wasting space' and 'larger to improve effectiveness')
with constant-time adds and checks,
where we add() all things we see,
where a get() of the same key, asking "have we added (seen) X before", is answered either with
"quite possibly" or
"definitely not"


Why is that useful at all?

Consider you have huge yet slow-to-access storage.

The most common operation is checking whether you have somethingWhether you have something or

It would be great if we had a structure in RAM that could answer "I definitely don't have this" or "I definitely have this" with perfect accuracy.

A hashmap would be one obvious implementation, but would also necessarily grow in size with the count of things getting stored, and you won't be able to guarantee you can keep it in RAM.


It turns out that if you can accept "definitely not" and "possibly yes", you can make a much smaller, fixed-size data structure so you can ensure it sits in RAM. As the data in the backing store grows, it merely has to say "possibly" more often where before it could say "definitely not".



How?

While it's not precisely how they work, but you can intuit the properties of how Bloom filters work, by modifying a hash map.

Consider

deciding how many buckets you have up front
for each bucket, store just true or false (instead a list of all values, as a real hashmap would do)

Since you're still deciding bucket by hash of the key,

if you fetch a false from that it means you didn't previously set a key like that.
if you fetch get a true from that, it ends up meaning "the hash can always have collided, so I can only answer possibly".


This also helps the intuition on why having a smaller-sized bloom filter (in the hashmap analogy: fewer buckets) increases the the possibility of getting a 'possibly' rather than a certain 'nope'.

You probably want to make the bloom filter as large as your RAM allows, to increase the amount of "nope"s and decrease the amount of false possibles.

But for some applications, the bloom filter can be quite small, and still e.g. save a reasonable amount of accesses, which is why this is useful even on microcontrollers.

You can quantify how large of a bloom filter you will want for a given set size, and desired error rate - there are calculators.


Alternatives

Depending on RAM abundance, CPU power, and on the case, while a much larger bloom filter works well enough, you may be better served by

  • a bit array doing the same task but giving a fully correct answer
more compact than a hash map, but with more practical restrictions
has to fit in memory for your given, fixed-size set -- or has to grow with the set (a bloom filter just becomes less efficient)
also requires you to have a simple, non-colliding mapping into that array
  • a hash map doing the same task but giving a fully correct answer
always falling back to slow storage if misses, but just counting on most things being hits
may not only checks presence from RAM, but may often gives the answer from RAM
(and a bloom filter in front of a memcache is often pointless)


See also