Daniel Lemire's points about low-level hardware optimization notwithstanding, it's worth p...

ssivark • yesterday at 5:45 PM • 9 replies • view on HN

Daniel Lemire's points about low-level hardware optimization notwithstanding, it's worth pointing out that binary search (or low-level implementation variants) is the best only if you know nothing about the data beyond the fact that it is sorted / monotonic.

If you have priors about the data distribution, then it's possible to design algorithms which use that extra information to perform MUCH better. eg: a human searching a physical paper dictionary can zoom into the right bunch of pages faster than pure idealized binary search; it's a separate matter it's hard for humans to continue binary search till the very end and we might default to scanning linearly for the last few iterations (cognitive convenience / affordances of human wetware / etc).

In mathematical language, searching a sorted list is basically inverting a monotonic function, by using a closed-loop control algorithm. Often, we could very well construct a suitable cost function and use gradient descent or its accelerated cousins.

More generally, the best bet to solving a problem more efficiently is always to use more information about the specific problem you want to solve, instead of pulling up the solution for an overly abstract representations. That can offer scalable orders of magnitude speedup compared to constant factor speedups from just using hardware better.

Replies

crazygringo • today at 12:09 AM

Sure, but the whole point is that you often don't know anything further about the data.

That's why b-trees are the standard in databases. The data could be anything, and its characteristics could massively change at any time, as you suddenly import a whole bunch of new rows at once.

And while you can certainly design algorithms around e.g. gradient descent to try to accelerate lookup, b-trees are already incredibly fast, and have lots of other benefits like predictable worse-case performance and I/O requirements, supporting range scans, ordered traversal, prefix conditions, etc.

So yes, you can certainly design lookup algorithms that are more efficient for particular data distributions, but they will also often lack other important properties. And b-trees are already so fast, improvements are often negligible -- like even if another algorithm produces a closer initial guess, it may be slower to locate the final item, or it may be faster on average but have horrible worst-case performance that makes it unusable.

Even with a paper dictionary, I've always used pretty much a binary search beyond the first initial guess, which only saves you a couple of hops. And actually, once I get to the right handful of pages I'm probably more linear than I should be, and I'd probably be faster if I tried to do a rigorous binary search, but I have to balance that with how long it takes to flip pages.

➕ show 1 reply

charleslmunger • yesterday at 9:50 PM

I've spent some brainpower on binary search and have not been able to beat this:

https://github.com/protocolbuffers/protobuf/blob/44025909eb7...

1. Check for dense list O(1) 2. Check upper bound 3. Constant trip count binary search

The constant trip count is great for the branch predictor, and the core loop is pretty tightly optimized for the target hardware, avoiding multiplies. Every attempt to get more clever made the loop worse and did not pay for itself. It's hard because it's an array-of-structs format with a size of 12, and mostly pretty small N.

➕ show 1 reply

rixed • yesterday at 7:00 PM

> it's worth pointing out that binary search (or low-level implementation variants) is the best only if you know nothing about the data beyond the fact that it is sorted / monotonic

Also if you do not learn anything about the data while performing the binary search, no? Like, if you are constantly below the estimate, you could gess that the distribution is biases toward large values and adjust your guess based on this prediction.

➕ show 3 replies

hinkley • yesterday at 6:44 PM

I swear I read an article about treaps but instead of being used to balance the tree, they used the weights to Huffman encode the search depth to reduce the average access time for heterogenous fetch frequencies.

I did not bookmark it and about twice a year I go searching for it again. Some say he’s still searching to this day.

➕ show 2 replies

painted-now • yesterday at 6:48 PM

> In mathematical language, searching a sorted list is basically inverting a monotonic function, by using a closed-loop control algorithm.

Never thought about it this way. Brilliant!

mycall • yesterday at 6:02 PM

Furthermore, with the vast and immediate knowledge that LLMs have, we could see a proliferation of domain-specific sorting algorithms designed for all types of purposes.

tantalor • yesterday at 6:40 PM

> use that extra information to perform MUCH better

Do you mean using a better estimator for the median value? Or something else?

➕ show 1 reply

locknitpicker • yesterday at 6:37 PM

> If you have priors about the data distribution, then it's possible to design algorithms which use that extra information to perform MUCH better.

You don't even need priors. See interpolation search, where knowing the position and value of two elements in a sorted list already allows the search to make an educated guess about where the element it's searching for is by estimating the likely place it would be by interpolating the elements.

➕ show 2 replies

alt Hacker News

Replies