Clever. My first impression was that surely this saturates the filter too fast as we're setting more bits at once but looks like the maths checks out. It's one of those non-intuitive things that I am glad I learned today.
You can actually make those two bits more independent afaik.
https://github.com/apache/parquet-format/blob/master/BloomFi...
https://github.com/facebook/rocksdb/blob/main/util/bloom_imp...
First one is useful for grasping the idea second one is more comprehensive. Both try to make multiple bit loads but try to make them as independent as possible as far a I can understand.
Also hash function has huge effect on bloom filter performance. I was getting 2x perf when using xxhash3 instead of wyhash even though wyhash is a faster hash function afaik.
Hmm, Bloom filters seem important. I'm wondering why my CS education never even touched on them and it's tbh triggering my imposter syndrome.
What are they running this code on?
I doubt their hardware is any faster shuffling bits in a uint32 than a uint64, and using uint64 should have a decent benefit to the false positive rate...
It's true that a fixed size bloom filter gives better compiler performance...
But another approach is to use C++ templating so you can have say 10 different 'fixed size' implementations with no additional overhead, and at runtime select the most suitable size.
For the couple of kilobytes of extra code size, this optimisation has to be worth it assuming table size is variable and there are some stats to give cardinality estimates...
Isn't this idea similar to the classic "power of 2 random choices"
https://www.eecs.harvard.edu/~michaelm/postscripts/handbook2...
This article is a little confusing. I think this is a roundabout way to invent the blocked bloom filter with k=2 bits inserted per element.
It seems like the authors wanted to use a single hash for performance (?). Maybe they correctly determined that naive Bloom filters have poor cache locality and reinvented block bloom filters from there.
Overall, I think block bloom filters should be the default most people reach for. They completely solve the cache locality issues (single cache miss per element lookup), and they sacrifice only like 10–15% space increase to do it. I had a simple implementation running at something like 20ns per query with maybe k=9. It would be about 9x that for native Bloom filters.
There’s some discussion in the article about using a single hash to come up with various indexing locations, but it’s simpler to just think of block bloom filters as:
1. Hash-0 gets you the block index
2. Hash-1 through hash-k get you the bits inside the block
If your implementation slices up a single hash to divide it into multiple smaller hashes, that’s fine.