You can use the original model to compress the kv cache and get ∞x compression, since the prediction...

0-_-0 • today at 9:02 AM • 1 reply • view on HN

You can use the original model to compress the kv cache and get ∞x compression, since the prediction is perfect. The cost is time, and I don't see how this could be worth it.

Replies

wongarsu • today at 11:04 AM

The tradeoff gets better the bigger your primary model, and probably with bigger batch sizes. The KV cache can consume a lot of expensive VRAM, and the VRAM and compute costs of the predictor model become a small fraction of the cost of the primary model

For serving a 1T model with 16 concurrent requests this could make a lot of sense. For a 8B model with a single request far less so

➕ show 1 reply

alt Hacker News

Replies