This is cool. It makes storage of the KV cache much smaller, making it possible to keep more of it in fast memory.
Bandwidth-wise it is worse (more bytes accessed) to generate and do random recall on than the vanilla approach, and significantly worse than a quantized approach. That’s because the reference needs to be accessed.
I guess implied is that since the KV cache is smaller, the probability is higher that the parts it that are needed are in fast memory, and that bandwidth requirements of slow links is reduced, and performance goes up.
Would be interesting with a discussion about benefits/drawbacks of the approach. Ideally backed by data.