Sure, but any classical attention mechanism is quadratic in context length.

saagarjha • today at 10:54 AM • 1 reply • view on HN

Replies

But text generation is quadratic after the KV cache optimization. If every decode step now has to recompute KV cache including its latest and most expensive tokens (even with a quick, "draft" model) that's even worse.

alt Hacker News

Replies