Low hanging? how low hanging are we talking, the basic stuff is gone. Largely big challenges around ...

minraws • today at 6:56 PM • 0 replies • view on HN

Low hanging? how low hanging are we talking, the basic stuff is gone. Largely big challenges around quantization were solved 2 years ago, and we have just been improving from there.

But can massive gains still be made? Definitely.

The entire AI hype is based on the paper Attention is all you need, and Attention is basically loading a huge matrix of all the tokens in memory, how well you can optimize this attention layer is basically how most architectures are trying to solve for performance and memory usage.

Only one with significant gains in it is DeepSeek (or so I would like to believe because others don't make their work open for folks like me not in Big AI Labs to read). Their MLA architecture reduced KV-cache memory requirements by upto 90%, ofc that's purely architectural change.

With some quantization like Turboquant from google you could push it down to ~1/3 of that. So 96% memory savings when talking about kv-cache.

But the models are close to being saturated for quantization based memory optimizations. We will have to see some architectural changes for a significant shift now.

alt Hacker News