Data structures: Difference between revisions

Revision as of 17:34, 27 March 2024

Bloom filter

"A Bloom filter is a space-efficient probabilistic data structure [...] used to test whether an element is a member of a set"[1], with occasional false positives (and quantifiable amount of them), but no false negatives.

A little more mechanically put, a bloom filter is

a fixed-size structure (a balance between 'small meaning not wasting space' and 'larger to improve effectiveness')

with constant-time adds and checks,

where we add() all things we see,

where a get() of the same key, asking "have we added (seen) X before", is answered either with

"quite possibly" or

"definitely not"

Why is that useful at all?

Consider you have huge yet slow-to-access storage.

It would be great if we only access that when we need to.

If we had an index-like thing in RAM that could answer "I definitely don't have this" or "I definitely have this", that would be excellent, but the size of that structure will grow with the amount of data in the store, so over time you won't be able to guarantee you can keep it in RAM.

It turns out that if you can accept the answer "definitely not" and "possibly yes", you can make a much smaller (and fixed-size) data structure.

With fully correct answers, there is a size at which you have to say you can't. Bloom filters instead slowly degrades the quality of its answers instead.

So yes, as the store actual store grows, our index becomes slowly less useful, having to say 'definitely not' less often, and 'possibly' more often.

But as long as it can say 'definitely not' a bunch, this still keeps on being useful. ...in ways you can quantify when you know how much that difference costs you.

How does it work?

While it's not precisely how they work, you can intuit the properties of how Bloom filters work, by modifying a hash map.

Consider

deciding how many buckets you have up front

for each bucket, store just true or false (instead of all values going to that bucket, as a real hashmap would do)

Since you're still deciding bucket by hash of the key,

it fills the allocated space pretty evenly

if you fetch a false from that it means you didn't previously set a key like that.

if you fetch get a true from that, it ends up meaning "the hash can always have collided, so I can only answer possibly".

This also helps the intuition on why having a smaller-sized bloom filter (in the hashmap analogy: fewer buckets) increases the the possibility of getting a 'possibly' rather than a certain 'nope'.

You probably want to make the bloom filter as large as your RAM allows (...wel, up to the current/expected amount of items stored), to increase the amount of "nope"s and decrease the amount of false possibles.

But for some applications, the bloom filter can be quite small, and still e.g. save a reasonable amount of accesses, which is why this is useful even on microcontrollers.

You can quantify how large of a bloom filter you will want for a given set size, and desired error rate - there are calculators.

@@ Line 101: / Line 101: @@
 It would be great if we only access that when we need to.
-If we had a structure in RAM that could answer "I definitely don't have this" or "I definitely have this", that would be great,
+If we had an index-like thing in RAM that could answer "I definitely don't have this" or "I definitely have this",
-but that would grow alongside the size of the store, so over time you won't be able to guarantee you can keep it in RAM.
+that would be ''excellent'', but the size of that structure will grow with the amount of data in the store,
+so over time you won't be able to guarantee you can keep it in RAM.
 It turns out that if you can accept the answer "definitely not" and "''possibly'' yes",
-you can make a much smaller, fixed-size data structure so you can ensure it sits in RAM.
+you can make a ''much'' smaller (and fixed-size) data structure.
-As the data in the backing store grows, it merely has to say "possibly" more often where before it could say "definitely not".
+With fully correct answers, there is a size at which you have to say you ''can't''.
+Bloom filters instead slowly degrades the quality of its answers instead.
+So yes, as the store actual store grows, our index becomes slowly less useful,
+having to say 'definitely not' less often, and  'possibly' more often.
+But as long as it can say 'definitely not' a ''bunch'', this still keeps on being useful.
+...in ways you can quantify when you know how much that difference costs you.
-'''How?'''
+'''How does it work?'''
 While it's not precisely how they work, you can intuit the properties of how Bloom filters work, by modifying a hash map.
@@ Line 119: / Line 129: @@
 Consider
 : deciding how many buckets you have up front
-: for each bucket, store just true or false {{comment|(instead a list of all values, as a real hashmap would do)}}
+: for each bucket, store just true or false {{comment|(instead of all values going to that bucket, as a real hashmap would do)}}
 Since you're still deciding bucket by hash of the key,
+: it fills the allocated space pretty evenly
 : if you fetch a ''false'' from that it means you didn't previously set a key like that.
 : if you fetch get a ''true'' from that, it ends up meaning "the hash can always have collided, so I can only answer ''possibly''".
@@ Line 128: / Line 139: @@
 This also helps the intuition on why having a smaller-sized bloom filter (in the hashmap analogy: fewer buckets) increases the the possibility of getting a 'possibly' rather than a certain 'nope'.
-You probably want to make the bloom filter as large as your RAM allows, to increase the amount of "nope"s and decrease the amount of false possibles.
+You probably want to make the bloom filter as large as your RAM allows (...wel, up to the current/expected amount of items stored),
+to increase the amount of "nope"s and decrease the amount of false possibles.
 But for some applications, the bloom filter can be ''quite'' small, and still e.g. save a reasonable amount of accesses,
@@ Line 136: / Line 148: @@
+<!--
 '''Alternatives'''
 Depending on RAM abundance, CPU power, and on the case, while a much larger bloom filter works well enough, you may be better served by
-* a [[bit array]] doing the same task but giving a fully correct answer
+* a [[bit array]] is one of those 'put all the answer in RAM until you can't fit in'
-:: more compact than a hash map, '''but''' with more practical restrictions
+:: but: more compact than e.g. hashmap, often fixed-size and not easy to alter on a changing store
-::: has to fit in memory for your given, fixed-size set -- or ''has'' to grow with the set (a bloom filter just becomes less efficient)
+:: also requires you to have a simple, non-colliding mapping into that array
-::: also requires you to have a simple, non-colliding mapping into that array
 * a [[hash map]] doing the same task but giving a fully correct answer
 * a [[memcache]]    (which is often an in-memory [[hashmap]])
@@ Line 152: / Line 164: @@
 :: may not only checks presence from RAM, but may often gives the answer from RAM
 :: (and a bloom filter in front of a memcache is often pointless)
+-->
 See also
 * [[Bloom embeddings]]

Data structures: Difference between revisions

Revision as of 17:34, 27 March 2024

Bloom filter

Navigation menu