Nice, although perhaps slightly academic given that good KV cache compression algorithms already exi...

mike_hearn • today at 9:41 AM • 0 replies • view on HN

Nice, although perhaps slightly academic given that good KV cache compression algorithms already exist. Probably the frontier labs were using them for a long time already. Nice to have it in llama.cpp though.

I'm curious who "we" refers to. I can't see any authorship information or a paper and this is the user's only repository. Maybe it doesn't need one. Also interesting that it was developed and tested on AMD hardware.

The main utility of this beyond just saving money for model servers would be deliberately prefilling very long contexts and then saving them to fast flash so you can then later quickly load and query them. I think only Anthropic's API would give enough control to do this today, maybe Google's, OpenAI's makes caching fully implicit. Like one or two prompts per codebase or something like that, so you can then query the entire codebase in parallel with questions without needing grepping or RAG. Modern serving pipelines all use disaggregated prefill as far as I know so there are inter-machine transfers anyway, and it directly saves on GPU cost.

alt Hacker News