if I understand you correctly, this is essentially what vllm does with their paged cache, if I’ve mi...

FuckButtons • yesterday at 10:56 PM • 1 reply • view on HN

if I understand you correctly, this is essentially what vllm does with their paged cache, if I’ve misunderstood I apologize.

Replies

zozbot234 • yesterday at 11:08 PM

Paged Attention is more of a low-level building block, aimed initially at avoiding duplication of shared KV-cache prefixes in large-batch inference. But you're right that it's quite related. The llama.cpp folks are still thinking about it, per a recent discussion from that project: https://github.com/ggml-org/llama.cpp/discussions/21961

alt Hacker News

Replies