Boris, wait, wait, wait,
Why not use tired cache?
Obviously storage is waaay cheaper than recalculation of embeddings all the way from the very beginning of the session.
No matter how to put this explanation — it still sounds strange. Hell — you can even store the cache on the client if you must.
Please, tell me I’m not understanding what is going on..
otherwise you really need to hire someone to look at this!)
I assume they are already storing the cache on flash storage instead of keeping it all in VRAM. KV caches are huge - that’s why it’s impractical to transfer to/from the client. It would also allow figuring out a lot about the underlying model, though I guess you could encrypt it.
What would be an interesting option would be to let the user pay more for longer caching, but if the base length is 1 hour I assume that would become expensive very quickly.
I don't think you can store the cache on client given the thinking is server side and you only get summaries in your client (even those are disabled by default).
Same question I had in https://news.ycombinator.com/item?id=47819914
I still don't understand it, yes it's a lot of data and presumably they're already shunting it to cpu ram instead of keeping it on precious vram, but they could go further and put it on SSD at which point it's no longer in the hotpath for their inference.