logoalt Hacker News

b65e8bee43c2ed0today at 6:42 AM1 replyview on HN

the exchange rate between text and its representation in memory is brutal. here's a bit from a recent article:

>An 82 GB footprint in DDR3 on a 2016 Xeon. About 25 GB of weights and 56 GB of KV cache at the full 262K context. The KV cache is larger than the model.

262k tokens is not much at all. with ~5 characters per token, that's only 1.3 MB of plaintext.


Replies

londons_exploretoday at 8:14 AM

The providers must have a more efficient approach. Most cache every request for 12+ hours, and they certainly can't spare 100GB of ram per request for 12 hours.

show 2 replies