This is very similar to what I stated here:

antirez • today at 7:02 PM • 1 reply • view on HN

This is very similar to what I stated here: https://x.com/antirez/status/2038241755674407005

That is, basically, you just rotate and use the 4 bit centroids given that the distribution is known, so you don't need min/max, and notably, once you have that, you can multiply using a lookup table of 256 elements when doing the dot product, since two vectors have the same scale. The important point here is that for this use case it is NOT worth to use the 1 bit residual, since for the dot product, vector-x-quant you have a fast path, but quant-x-quant you don't have it, and anyway the recall difference is small. However, on top of that, remember that new learned embeddings tend to use all the components in a decent way, so you gain some recall for sure, but not as much as in the case of KV cache.

Replies

justsomeguy1996 • today at 7:30 PM

I think the main benefits are:

- Slightly improved recall

- Faster index creation

- Online addition of vectors without recalibrating the index

The last point in particular is a big infrastructure win I think.

alt Hacker News

Replies