I always find it interesting how often the simplest hash table layouts end up performing best in real workloads. Once you avoid pointer chasing and keep everything in a compact array, CPU caches do most of the heavy lifting.
It’s also a good reminder that clarity of layout often beats more “clever” designs, especially when the dataset fits comfortably in memory.
I'm a big fan of the basic power of two choices hash table design. It's simple to understand and implement, has reasonable constant factors, and hits high load factors on real world datasets.
You can use more elaborate probe and relocation schemes, but just choosing the less full bucket and resizing if both choices are full gets you surprisingly far.
To me, these sorts of examples always seem contrived. To the first order, I've never had a real hash table problem that was on machine word keys.
I've nearly always had a variable length string or other complex structure that was being hashed, not their handles.
Back in my early career in C, this would be a generic API to hash and store void pointers, but the pointers were not being hashed. The domain-specific hash function needed to downcast and perform the appropriate remote memory access to fetch the variable-length material that was actually being hashed.
Until you get high memory contention from the rest of the code. Once eviction gets high you get some pretty counterintuitive improvements by fixing things that seem like they shouldn’t need to be fixed.
My best documented case was a 10x speed up from removing a double lookup that was killing caches.