For those of us who operate on site, we have to add back network latency, which negates this win entirely and makes a proprietary cloud solution like this a nonstarter.
Are there vector DBs with 100B vectors in production which work well? There was a paper which showed that there's 12% loss in accuracy at just 1 mln vectors. Maybe some kind of logical sharding is another option, to improve both accuracy and speed.
Fun!
I was curious given the cloud discussion - a quick search suggests default AWS SSD bandwidth is 250 MB/s, and you can pay more for 1 GB/s. Similar for s3, one http connection is < 100 MB/s, and you can pay for more parallel connections. So the hot binary quantized search index is doing a lot of work to minimize these both for the initial hot queries and pruning later fetches. Very cool!
Out of curiosity, how is the 92% recall calculated? For a given query, is the recall compared to the true topk of all 100B vectors vs. recall at each of N shards compared to the topk of each respective shard?
This is at 92% recall. Could be worse, but could definitely be much better. Quantization and hierarchical clustering are tricks that lead to awesome performance at the cost of extremely variable quality, depending on the dataset.
The offline/local dev point is underrated. Being able to iterate without network latency or metered API costs makes a huge difference for prototyping. The challenge is making sure your local setup actually matches prod behavior. I've been burned by pgvector working fine locally then hitting performance cliffs at scale when the index doesn't fit in memory anymore.
Using Hierarchical Clustering significantly reduces recall; this is a solution we used and abandoned three years ago.
This is legitimately pretty impressive. I think the rule of thumb is now, go with postgres(pgvector) for vector search until it breaks, then go with turbopuffer.