logoalt Hacker News

Bloom filters are good for search that does not scale

208 pointsby birdculture11/04/202541 commentsview on HN

Comments

susam11/04/2025

When I worked at RSA over a decade ago, we developed Bloom filter-based indexing to speed up querying on a proprietary database that was specialised for storing petabytes of network events and packet data. I implemented the core Bloom filter-based indexer based on MurmurHash2 functions and I was quite proud of the work I did back then. The resulting improvement in query performance looked impressive to our customers. I remember the querying speed went up from roughly 49,000 records per second to roughly 1,490,000 records per second, so nearly a 30-fold increase.

However, the performance gain is not surprising at all since Bloom filters allow the querying engine to skip large blocks of data with certainty when the blocks do not contain the target data. False negatives are impossible. False positives occur but the rate of false positives can be made very small with well-chosen parameters and trade-offs.

With 4 hash functions (k = 4), 10007 bits per bloom filter (m = 10007) and a new bloom filter for every 1000 records (n = 1000), we achieved a theoretical false-positive rate of only 1.18% ((1 - e(-k * n / m)) ^ k = 0.0118). In practice, over a period of 5 years, we found that the actual false positive rate varied between 1.13% and 1.29%.

The only downside of a false positive is that it makes the query engine read a data block unnecessarily to verify whether the target data is present. This affects performance but not correctness; much like how CPU branch misprediction affects performance but not correctness.

A 30-fold increase in querying speed with just 1.25 kB of overhead per data block of 1000 records (each block roughly 1 MB to 2 MB in size) was, in my view, an excellent trade-off. It made a lot of difference to the customer experience, turning what used to be a 2 minute wait for query results into a wait of just about 5 seconds, or in larger queries, reducing a 30 minute wait to about 1 minute.

show 5 replies
MattPalmer108611/04/2025

May be true for offline full text search, but not true for online string search.

I invented a very fast string search algorithm based on bloom filters. Our paper [1] was accepted to the Symposium of Experimental Algorithms 2024 [2]. Code can be found here [3].

[1] https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.S...

[2] https://sea2024.univie.ac.at/accepted-papers/

[3] https://github.com/nishihatapalmer/HashChain

show 1 reply
pncnmnp11/04/2025

When my friends and I were undergrads (3rd year, I believe), we had an absolute blast exploring this exact topic - the intersection of Bloom Filters and client side searching. So much so that it became part of our undergrad thesis.

It all started when Stavros's blog was circulated on Hacker News! The way we approached the search part was by using "Spectral Bloom Filters" - https://theory.stanford.edu/~matias/papers/sbf-sigmod-03.pdf - which is based on a paper by Saar Cohen and Yossi Matias from the early 2000s - its basically an iteration on the counting bloom filters. We used the minimal selection and minimal increase algorithm from the paper for insertion and ranking of results.

I wrote a blog on it too - https://pncnmnp.github.io/blogs/spectral-bloom-filters.html

Some slides - https://pncnmnp.github.io/blogs/sthir-talk-2020.pdf

adamzwasserman11/04/2025

The "no sharing between filters" insight clicked for me on a different problem.

I needed to filter items by tags. Bloom filter per item seemed clever - quick membership checks. But with thousands of items sharing dozens of tags, each filter re-encodes the same vocabulary. Pure waste.

Switched to an inverted index (tag → item list) with bloom filters per chunk of the index. Now the tag vocabulary is shared, and bloom filters just speed up chunk-skipping when the index grows large.

TFA's mistake is using bloom filters -instead- of an inverted index rather than on top of one. The amortization patterns stack, they don't compete.

show 1 reply
pauldix11/04/2025

I believe you could do this effectively with COBS (COmpact Bit Sliced signature index): https://panthema.net/2019/1008-COBS-A-Compact-Bit-Sliced-Sig...

It's a pretty neat algorithm from a paper in 2019 for the application "to index k-mers of DNA samples or q-grams from text documents". You can take a collection of bloom filters built for documents and then combine them together to have a single filter that will tell you which docs it maps to. Like an inverted index meets a bloom filter.

I'm using it in a totally different domain for an upcoming release in InfluxDB (time series database).

There's also code online here: https://github.com/bingmann/cobs

cristaloleg11/05/2025

Curious why no one mentioned XOR filter yet

https://github.com/FastFilter

show 1 reply
hijinks11/04/2025

is there a better way then bloom filters to handle needle in the haystack type searches where the haystack might be terabytes of data and you only want a few lines?

show 1 reply
taeric11/04/2025

I will forever think of Bloom filters as "bouncer filters." Could go with concierge filter. Basically, it is the equivalent of every movie where the detective is asking the front desk various attributes of who they are looking for.

It is not hard to see how you could start asking the front desk to track every obscure attribute and to expect to fall over for various reasons.

hinkley11/04/2025

> Fun fact: There is a nice implementation of this exact algorithm that is still used in the wild.

I thought that was going to be a link to Google.com

pi_22by711/04/2025

The key insight about bloom filters lacking synergy is excellent. The ~7K document crossover point makes sense because inverted indexes amortize dictionary storage across all documents while bloom filters must encode it linearly per document

show 1 reply
KevBurnsJr11/04/2025

Reminds me of @danthegoodman's project: bloomsearch [1]

[1] https://github.com/danthegoodman1/bloomsearch

sanskarix11/04/2025

the beautiful thing about bloom filters is they let you say "definitely not here" without checking everything. that asymmetry is weirdly powerful for specific problems.

I've seen them save startups real money in caching layers - checking "did we already process this event" before hitting the database. false positives are fine because you just check the database anyway, but true negatives save you thousands of queries.

the trick is recognizing when false positives don't hurt you. most engineers learn about them in theory but never find that practical use case where they're actually the right tool. same with skip lists and a bunch of other algorithms - brilliant for 2% of problems, overkill for the rest.

show 2 replies