My favourite small hash table

137 points • by speckx • yesterday at 2:47 PM • 32 comments • view on HN

Comments

> The table occupies at most 32 GiB of memory.

This constraint allows making a linear array of all the 4 billion values, with the key as array index, which fits in 16 GiB. Another 500 MiB is enough to have a bit indicating present or not for each.

Perhaps text strings as keys and values would give a more interesting example...

➕ show 3 replies

unixhero • today at 6:32 AM

I am dipping my toes in programming. But I can't follow this without graphical representations of the tables as the author (brilliantly) walks through it.

Panzerschrek • today at 6:19 AM

I don't understand the part about key removal. Is it correct to shift values to the left? Can't it break lookups for keys inserted into their natural positions?

artur44 • yesterday at 4:55 PM

I always find it interesting how often the simplest hash table layouts end up performing best in real workloads. Once you avoid pointer chasing and keep everything in a compact array, CPU caches do most of the heavy lifting.

It’s also a good reminder that clarity of layout often beats more “clever” designs, especially when the dataset fits comfortably in memory.

➕ show 3 replies

attractivechaos • today at 2:15 AM

This is a smart implementation of Robin Hood hashing I am not aware of. In my understanding, a standard implementation keeps the probe length of each entry. This one avoids that due to its extra constraints. I don't quite understand the following strategy, though

> To meet property (3) [if the key 0 is present, its value is not 0] ... "array index plus one" can be stored rather than "array index".

If hash code can take any value in [0,2^32), how to define a special value for empty buckets? The more common solution is to have a special key, not a special hash code, for empty slots, which is easier to achieve. In addition, as the author points out, supporting generic keys requires to store 32-bit hash values. With the extra 4 bytes per bucket, it is not clear if this implementation is better than plain linear probing (my favorite). The fastest hash table implementations like boost and abseil don't often use Robin Hood hashing.

clbrmbr • yesterday at 4:20 PM

Awesome blog! Looking at the code I feel like there’s a kindred soul behind that keyboard, but there’s no About page afaict. Who beeth this mysterious writer?

➕ show 2 replies

Joker_vD • today at 12:56 AM

> If the Zba extension is present, sh3add.uw is a single instruction for zero-extending idx from 32 bits to 64, multiplying it by sizeof(uint64_t), and adding it to slots.

Yay, we've got an equivalent of SIB byte but as three (six?) separate opcodes. Well, sub-opcodes.

It's a shame though that Zcmp extension didn't get into RVA23 even as an optional extension.

judofyr • yesterday at 4:35 PM

Is there a specific reason to store the key + value as an `uint64_t` instead of just using a struct like this?

    struct slot {
      uint32_t key;
      uint32_t value;
    }

➕ show 4 replies

ww520 • yesterday at 6:03 PM

This hash table is pretty good. It has at best one memory read if there’s no collision. Randomized key might introduce any level of key collision though.

air217 • yesterday at 6:35 PM

Used Gemini to explain this. Very interesting and I think Gemini did a good job explaining the Robin hood fairness mechanism

https://gemini.google.com/share/5add15a1c12f

alt Hacker News

My favourite small hash table

Comments