logoalt Hacker News

londons_exploretoday at 8:14 AM2 repliesview on HN

The providers must have a more efficient approach. Most cache every request for 12+ hours, and they certainly can't spare 100GB of ram per request for 12 hours.


Replies

dist-epochtoday at 9:31 AM

This is one reason why price of SSDs also doubled, not just of RAM.

> LMCache extends the KV Cache from the NVIDIA GPU's fast HBM (Tier 1) to larger, more cost-effective tiers like CPU RAM and local SSDs.

https://cloud.google.com/blog/topics/developers-practitioner...

choppafacetoday at 8:54 AM

or maybe they don’t actually cache (fully) but lie and just don’t charge the user right now. at least half the users, who are probably also using the most similar tokens / prompts, wouldn’t really know the difference in latency (or care)

show 1 reply