Skip to main content

🌸 Bloom Filters

Probably yes or definitely no

The Nightclub Bouncer Analogy

A bouncer at a nightclub:

  • Checks if you're on the guest list
  • Sometimes says "you might be on the list" when you're not
  • Doesn't say "definitely not" if you actually are

Bloom filters work like this:

  • "Definitely NOT in the set" - accurate (in a standard bloom filter, assuming consistent hashing and that bits are only set, not cleared)
  • "Probably in the set" - might be wrong

False positives are possible. False negatives don't happen in the standard bloom filter under the usual assumptions above (but can happen if the filter is misused, corrupted, or you try to delete by clearing bits).


Why Use a Bloom Filter?

The Problem

You have a very large list of email addresses in a spam list.
For each incoming email, check if sender is spam.

Option 1: Store all emails in memory (huge!)
Option 2: Check database every time (slow!)
Option 3: Bloom filter sized for a low false-positive rate (fast, very memory efficient)

Bloom filters can use much less space than storing the actual elements, at the cost of occasional false positives.


How Bloom Filters Work

The Structure

A bit array of mm bits, all starting at 0:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 0  1  2  3  4  5  6  7  8  9

Plus kk hash functions that map elements to positions.

Adding an Element

Apply k hash functions, set those bits to 1:

Add "hello":
  hash1("hello") = 2
  hash2("hello") = 5
  hash3("hello") = 9

Set bits 2, 5, 9 to 1:
[0, 0, 1, 0, 0, 1, 0, 0, 0, 1]
       ↑        ↑           ↑

Checking Membership

Apply same hash functions, check if ALL bits are 1:

Check "hello":
  hash1 = 2, hash2 = 5, hash3 = 9
  All are 1? Yes → "Probably in set"

Check "world":
  hash1 = 1, hash2 = 4, hash3 = 7
  Bit 1 is 0! → "Definitely NOT in set"

False Positives

Check "abc":
  hash1 = 2, hash2 = 5, hash3 = 9

Wait - those are the same positions as "hello"!
All bits are 1, but "abc" was not added.

False positive: "Probably in set" when it's not.

Visual Example

Empty bloom filter (a small bit array, a few hash functions):
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Add "cat" → hashes to 1, 4, 7:
[0, 1, 0, 0, 1, 0, 0, 1, 0, 0]

Add "dog" → hashes to 2, 5, 8:
[0, 1, 1, 0, 1, 1, 0, 1, 1, 0]

Add "bird" → hashes to 0, 3, 6:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0]

Check "cat" → positions 1, 4, 7 all 1? Yes! ✓
Check "fish" → positions 2, 6, 9 → position 9 is 0! Not in set.
Check "ant" → positions 1, 3, 7 all 1? Yes, but ant was not added!
             False positive!

Mathematical Properties

False Positive Rate

Depends on:
  m = number of bits
  n = number of elements
  k = number of hash functions

There are standard formulas that relate $m$ (bits), $n$ (items), and $k$ (hash functions). Many libraries can compute good defaults for you.

Rule of thumb: more bits per element (and a reasonable $k$) generally means a lower false-positive rate.

Space Efficiency

Bloom filters can be dramatically smaller than exact structures like hash sets, but the exact savings depend on your target false-positive rate, overheads, and implementation.

Operations

OperationTimeDescription
AddO(k)Set k bits
CheckO(k)Check k bits
Remove❌Not supported!

Why No Removal?

"cat" uses bits 1, 4, 7
"dog" also uses bit 4

If we remove "cat" by clearing bits 1, 4, 7:
  Bit 4 affects "dog" too!
  Now "dog" looks like it's not in the set.

Solution: Use counting bloom filter (store counts, not bits)

Where Bloom Filters Are Used

1. Databases - Avoid Disk Reads

Before reading from disk:
  Check bloom filter: "Is key possibly on disk?"
  No → Skip disk read entirely!
  Yes → Read disk (might be false positive, but rare)

Used by: Cassandra, HBase, LevelDB

In general, bloom filters are common in storage engines and databases to avoid unnecessary disk reads.

2. Web Browsers - Malicious URL Checking

Browsers and security tools often use compact local data to quickly rule out most non-malicious URLs before doing more expensive checks.

3. Spell Checkers

Store dictionary in bloom filter.
"Is 'teh' a word?" → Filter says no → Probably misspelled

4. Network Routers

Check if packet was seen before (for deduplication).
Space too limited for exact tracking.
Bloom filter fits in cache memory!

5. Cryptocurrency

Bloom filters have been used by some lightweight clients to query relevant transactions without downloading an entire blockchain (newer protocols may use different approaches).

Bloom Filter vs Hash Set

PropertyBloom FilterHash Set
SpaceVery smallProportional to data
False positivesYes (controllable)No
False negativesNoNo
DeletionNot supportedSupported
Lookup timeO(k)O(1)

Note: Bloom filter membership checks are O(k)O(k), but kk is typically a small constant. Hash set lookups are usually O(1)O(1) on average (but can degrade in pathological cases).

Use Bloom Filter when:

  • Memory is constrained
  • False positives are acceptable
  • You only need "probably in" or "definitely not"

Counting Bloom Filter

Variant that supports deletion:

Instead of bits, store counts:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Add "cat" → increment positions 1, 4, 7:
[0, 1, 0, 0, 1, 0, 0, 1, 0, 0]

Add "dog" → increment position 4:
[0, 1, 0, 0, 2, 0, 0, 1, 0, 0]
             ↑ count is 2 now

Remove "cat" → decrement positions 1, 4, 7:
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

"dog" still intact!

Tradeoff: Uses more space (integers instead of bits).


Common Mistakes

1. Trusting "Probably In"

Verify with the actual data source for critical operations.

2. Undersizing the Filter

Too few bits = high false positive rate = useless filter.

3. Wrong Number of Hash Functions

Too few: high false positives. Too many: filter fills up fast.

4. Expecting Deletion

Standard bloom filters can't delete. Plan accordingly!


FAQ

Q: When are false positives okay?

When the cost of a false positive is low (like an extra DB query), but the benefit of true negatives is high (skipping millions of queries).

Q: Can I reduce the false positive rate later?

Not without rebuilding (or layering another filter). Plan upfront!

Q: How do I choose m and k?

Use formulas based on expected n and desired false positive rate. Most libraries calculate this for you.

Q: Is there a data structure with no false positives?

Yes - a regular set! But it uses much more memory.


Summary

A bloom filter is a space-efficient probabilistic structure for set membership.

Key Takeaways:

  • "Definitely not" is designed to be correct in a standard Bloom filter under normal assumptions
  • "Probably in" might be wrong (false positive)
  • Uses bit array + multiple hash functions
  • Much smaller than storing actual elements
  • Can't delete elements (use counting variant)
  • Used in: databases, browsers, network routers

When space matters more than perfect accuracy, bloom filters shine!

Leave a Comment

Comments (0)

Be the first to comment on this concept.

Comments are approved automatically.