logoalt Hacker News

iidsampleyesterday at 7:24 PM0 repliesview on HN

We at UT-Austin have done some academic work to handle the same challenge. Will be curious if serving engines could modified. https://arxiv.org/abs/2412.16434 .

The core idea is we can use user-activity at the client to manage KV cache loading and offloading. Happy to chat more!!