logoalt Hacker News

jychangyesterday at 11:04 PM2 repliesview on HN

The catch that you're missing is that Deepseek did this ages ago.

They're just using MLA, which is well known to reduce KV size by 90%. You know, the MLA that's used in... Deepseek V2, Deepseek V3, Deepseek R1, Deepseek V3.1, Deepseek V3.2.

Oh, and they also added some hybrid linear attention stuff to make it faster at long context. You know who else uses hybrid linear attention? Deepseek V3.2.


Replies

storusyesterday at 11:14 PM

Linear attention is really bad, it's only good for benchmaxing but it leads to a loss of valuable granularity, which can be felt in the latest DeepSeek randomly forgetting/ignoring/correcting explicitly stated facts in the prompt.

erichoceanyesterday at 11:10 PM

Kimi K2 also uses MLA, and Kimi Linear runs Kimi Delta Attention (it's SSM-like) for three out of every four layers (the fourth uses MLA).

show 1 reply