You should be able to save a lot on prefill by stashing KV-cache shared prefixes (since KV-cache for...

zozbot234 • yesterday at 4:38 PM • 1 reply • view on HN

You should be able to save a lot on prefill by stashing KV-cache shared prefixes (since KV-cache for plain transformers is an append-only structure) to near-line bulk storage and fetching them in as needed. Not sure why local AI engines don't do this already since it's a natural extension of session save/restore and what's usually called prompt caching.

Replies

FuckButtons • yesterday at 10:56 PM

if I understand you correctly, this is essentially what vllm does with their paged cache, if I’ve misunderstood I apologize.

➕ show 1 reply

alt Hacker News

Replies