ANN v3: 200ms p99 query latency over 100B vectors

92 points • by _peregrine_ • last Wednesday at 7:58 PM • 42 comments • view on HN

Comments

jascha_eng • last Thursday at 12:38 AM

This is legitimately pretty impressive. I think the rule of thumb is now, go with postgres(pgvector) for vector search until it breaks, then go with turbopuffer.

➕ show 3 replies

mmaunder • today at 3:04 PM

For those of us who operate on site, we have to add back network latency, which negates this win entirely and makes a proprietary cloud solution like this a nonstarter.

➕ show 1 reply

kgeist • today at 2:31 PM

Are there vector DBs with 100B vectors in production which work well? There was a paper which showed that there's 12% loss in accuracy at just 1 mln vectors. Maybe some kind of logical sharding is another option, to improve both accuracy and speed.

➕ show 3 replies

lmeyerov • today at 2:12 PM

Fun!

I was curious given the cloud discussion - a quick search suggests default AWS SSD bandwidth is 250 MB/s, and you can pay more for 1 GB/s. Similar for s3, one http connection is < 100 MB/s, and you can pay for more parallel connections. So the hot binary quantized search index is doing a lot of work to minimize these both for the initial hot queries and pruning later fetches. Very cool!

alanwli • today at 6:24 PM

Out of curiosity, how is the 92% recall calculated? For a given query, is the recall compared to the true topk of all 100B vectors vs. recall at each of N shards compared to the topk of each respective shard?

➕ show 1 reply

montroser • today at 5:08 PM

This is at 92% recall. Could be worse, but could definitely be much better. Quantization and hierarchical clustering are tricks that lead to awesome performance at the cost of extremely variable quality, depending on the dataset.

hwspeed • today at 8:51 PM

The offline/local dev point is underrated. Being able to iterate without network latency or metered API costs makes a huge difference for prototyping. The challenge is making sure your local setup actually matches prod behavior. I've been burned by pgvector working fine locally then hitting performance cliffs at scale when the index doesn't fit in memory anymore.

vander_elst • today at 8:22 PM

> 504MiB shared L3 cache

What CPU are they using here?

➕ show 1 reply

redskyluan • today at 4:36 PM

Using Hierarchical Clustering significantly reduces recall; this is a solution we used and abandoned three years ago.

shayonj • today at 1:53 PM

v cool and impressive!

➕ show 1 reply

alt Hacker News

ANN v3: 200ms p99 query latency over 100B vectors

Comments