The problem with this approach is that even recomputing a "draft" of the KV cache is still...

zozbot234 • today at 9:03 AM • 1 reply • view on HN

The problem with this approach is that even recomputing a "draft" of the KV cache is still quadratic in context length. Maybe you can get some constant savings by always recomputing the earliest tokens, but it's not a good tradeoff as context sizes grow.

Replies

saagarjha • today at 10:54 AM

Sure, but any classical attention mechanism is quadratic in context length.

➕ show 1 reply

alt Hacker News

Replies