logoalt Hacker News

TurboQuant: Redefining AI efficiency with extreme compression

330 pointsby ray__today at 5:00 AM91 commentsview on HN

Comments

amitporttoday at 7:47 AM

This is a great development for KV cache compression. I did notice a missing citation in the related works regarding the core mathematical mechanism, though. The foundational technique of applying a geometric rotation prior to extreme quantization, specifically for managing the high-dimensional geometry and enabling proper bias correction, was introduced in our NeurIPS 2021 paper, "DRIVE" (https://proceedings.neurips.cc/paper/2021/hash/0397758f8990c...). We used this exact rotational approach and a similar bias correction mechanism to achieve optimal distributed mean estimation. I also presented this work and subsequent papers in a private invited talk at Google shortly after publication. Given the strong theoretical overlap with the mechanisms in TurboQuant and PolarQuant, I hope to see this prior art acknowledged in the upcoming camera-ready versions.

show 2 replies
gavinraytoday at 1:33 PM

Can someone ELI5 these two concepts please, which make no sense to me:

  > "TurboQuant starts by randomly rotating the data vectors. This clever step simplifies the data's geometry"
I don't understand how taking a series of data and applying a random rotation could mathemetically lead every time to "simpler" geometry.

If I throw a bunch of shapes on the ground, tightly packed and touching each other, then rotate all of them, you can't guarantee that the new conglomerate shape is any more/less "simple" than before, right?

  > "Johnson-Lindenstrauss Transform to shrink complex, high-dimensional data while preserving the essential distances and relationships between data points. It reduces each resulting vector number to a single sign bit (+1 or -1)."
How can a boolean value preserve all of the relational and positional information between data points?
show 3 replies
akhenakhtoday at 12:46 PM

Someone implementing it on llamacpp already https://github.com/mudler/llama.cpp/commit/dee102db1bfd723c9...

show 2 replies
Serhii-Settoday at 3:02 PM

Compression research keeps producing surprisingly practical results. The interesting parallel in image formats — AVIF and JPEG XL both came from video codec research (AV1 and JPEG committee respectively), and the compression gains translated almost directly. Makes me wonder how much of the current AI quantization work will eventually land in production inference the same way.

pstolltoday at 11:48 AM

And a group has published an independent working implementation today, nice to see:

https://github.com/tonbistudio/turboquant-pytorch

benobtoday at 7:02 AM

This is the worst lay-people explanation of an AI component I have seen in a long time. It doesn't even seem AI generated.

show 2 replies
mmastractoday at 1:44 PM

Is this a tradeoff between GPU-computation-expense vs accuracy? ie: you could quantize into segments or grids on the unit circle/sphere/etc, but that's too expensive so it's better to just quantize to a Cartesian grid because the GPU can decompress cheaper?

iddantoday at 1:19 PM

I am guessing as Google is vertically integrated and "actually pays" for AI infra (compared to OpenAI & Anthropic that receives hardware as partnerships) they have a more urgent incentive to reduce model sizes. Also, Google and Apple will be the first to gain from running model on-device

show 1 reply
bilsbietoday at 12:37 PM

It seems like most breakthroughs I see are for efficiency? What are the most importsnt breakthroughs from the past two or three years for intelligence?

show 4 replies
bluequbittoday at 6:42 AM

I did not understand what polarQuant is.

Is is something like pattern based compression where the algorithm finds repeating patterns and creates an index of those common symbols or numbers?

show 3 replies
ssijaktoday at 11:32 AM

For my grug brain can somebody translate this to ELIgrug terms?

Does this mean I would be able to run 500b model on my 48gb macbook without loosing quality?

show 1 reply
zeeshana07xtoday at 9:46 AM

The gap between how this is described in the paper vs the blog post is pretty wide. Would be nice to see more accessible writing from research teams — not everyone reading is a ML engineer

show 2 replies
macleginntoday at 11:14 AM

"TurboQuant proved it can quantize the key-value cache to just 3 bits without requiring training or fine-tuning and causing any compromise in model accuracy" -- what do each 3 bits correspond to? Hardly individual keys or values, since it would limit each of them to 8 different vectors.

show 1 reply
lwhitoday at 12:42 PM

Will this help us run models locally?

maurelius2today at 7:46 AM

I'm somewhat at a loss here other than understanding the fundamentals. Can someone tell me how the compression impact performance?

show 2 replies
moktonartoday at 7:32 AM

Aren’t polar coordinates still n-1 + 1 for radius for n-dim vector? If so I understand that angles can be quantized better but when radius r is big the error is large for highly quantized angles right? What am I missing?

show 1 reply
_s_a_m_today at 1:11 PM

has the word "advanced", gotta be good

naaskingtoday at 1:47 PM

This sounds great! TurboQuant does KV cache compression using quantization via rotations, and ParoQuant [1] does weight compression using quantization via rotations! So we can get 4-bit weights that match bf16 precision, the KV cache goes down to 3 bits per key. This brings larger models and long contexts into the range of "possibly runnable" on beefy consumer hardware.

[1] https://github.com/z-lab/paroquant

lucrbvitoday at 8:50 AM

Sounds like Multi-Head Latent Attention (MLA) from DeepSeek

show 1 reply
QubridAItoday at 3:16 PM

[dead]

diablevvtoday at 2:11 PM

[dead]

wei03288today at 1:40 PM

[dead]

pugchattoday at 11:16 AM

[dead]

leontlovelesstoday at 12:02 PM

[dead]

veunestoday at 9:21 AM

[dead]

aledevvtoday at 9:37 AM

[dead]

rsmtjohntoday at 8:32 AM

[dead]

paxrel_aitoday at 12:59 PM

[dead]

mohsen1today at 8:00 AM

[dead]

hikaru_aitoday at 7:08 AM

[dead]

dev_tools_labtoday at 10:05 AM

[dead]

vaildegrafftoday at 11:58 AM

The accuracy preservation is impressive, but I'd want to see adversarial evaluation after quantization - not just benchmark scores. Compressed models can behave identically on clean inputs while diverging on edge cases. If your safety-critical behavior lives in the long tail of the distribution, a quantizer that rounds to the nearest centroid might round away your guardrails. Nobody publishes those numbers because nobody wants to find out.

mskkmtoday at 9:05 AM

Pied Piper vibes. As far as I can tell, this algorithm is hardly compatible with modern GPU architectures. My guess is that’s why the paper reports accuracy-vs-space, but conveniently avoids reporting inference wall-clock time. The baseline numbers also look seriously underreported. “several orders of magnitude” speedups for vector search? Really? anyone has actually reproduced these results?

show 2 replies