The providers must have a more efficient approach. Most cache every request for 12+ hours, and th...

londons_explore • today at 8:14 AM • 2 replies • view on HN

The providers must have a more efficient approach. Most cache every request for 12+ hours, and they certainly can't spare 100GB of ram per request for 12 hours.

Replies

dist-epoch • today at 9:31 AM

This is one reason why price of SSDs also doubled, not just of RAM.

> LMCache extends the KV Cache from the NVIDIA GPU's fast HBM (Tier 1) to larger, more cost-effective tiers like CPU RAM and local SSDs.

https://cloud.google.com/blog/topics/developers-practitioner...

choppaface • today at 8:54 AM

or maybe they don’t actually cache (fully) but lie and just don’t charge the user right now. at least half the users, who are probably also using the most similar tokens / prompts, wouldn’t really know the difference in latency (or care)

➕ show 1 reply

alt Hacker News

Replies