logoalt Hacker News

zozbot234today at 9:03 AM1 replyview on HN

The problem with this approach is that even recomputing a "draft" of the KV cache is still quadratic in context length. Maybe you can get some constant savings by always recomputing the earliest tokens, but it's not a good tradeoff as context sizes grow.


Replies

saagarjhatoday at 10:54 AM

Sure, but any classical attention mechanism is quadratic in context length.

show 1 reply