logoalt Hacker News

nkurzyesterday at 3:59 PM1 replyview on HN

It's trickier than that. Modern processors are speculative, which means that they guess at the result for a comparison and keep going along one side of a branch as far as they can until they are told they guessed wrong or hit some internal limit. If they guessed wrong, they throw away the speculative work, take a penalty of a handful of cycles, and do the same thing again from a different starting point.

Essentially, this means that all loops are already unrolled from the processors point of view, minus a tiny bit of overhead for the loop itself that can often be ignored. Since in binary search the main cost is grabbing data from memory (or from cache in the "warm cache" examples) this means that the real game is how to get the processor to issue the requests for the data you will eventually need as far in advance as possible so you don't have to wait as long for it to arrive.

The difference in algorithm for quad search (or anything higher than binary) is that instead of taking one side of each branch (and thus prefetching deeply in one direction) is that you prefetch all the possible cases but with less depth. This way you are guaranteed to have successfully issued the prefetch you will eventually need, and are spending slightly less of your bandwidth budget on data that will never be used in the actual execution path.

As others are pointing out, "number of comparisons" is almost useless metric when comparing search algorithms if your goal is predicting real world performance. The limiting factor is almost never the number of comparisons you can do. Instead, the potential for speedup depends on making maximal use of memory and cache bandwidth. So yes, you can view this as loop unrolling, but only if you consider how branching on modern processors works under the hood.


Replies

drob518yesterday at 4:31 PM

Yea, I get that the actual comparison instruction itself is insignificant. It's everything that goes along with it. Seems like quaternary is fetching more data, however.

For instance, if you have 8 elements, 01234567, and you're looking for 1, with binary, you'd fetch 4, 2, and then 1. With quaternary, you'd fetch 2, 4, 6, then 1. Obviously, if you only have 8 elements, you'd just delegate to the SIMD instruction, but if this was a much larger array, you'd be doing more work.

I guess on a modern processor, eliminating the data dependency is worth it because the processor's branch prediction and speculation only follows effectively a single path.

Would be interesting to see this at a machine cycle level on a real processor to understand exactly what is happening.

show 1 reply