logoalt Hacker News

saagarjhatoday at 10:54 AM1 replyview on HN

Sure, but any classical attention mechanism is quadratic in context length.


Replies

zozbot234today at 12:10 PM

But text generation is quadratic after the KV cache optimization. If every decode step now has to recompute KV cache including its latest and most expensive tokens (even with a quick, "draft" model) that's even worse.