the exchange rate between text and its representation in memory is brutal. here's a bit from a recent article:
>An 82 GB footprint in DDR3 on a 2016 Xeon. About 25 GB of weights and 56 GB of KV cache at the full 262K context. The KV cache is larger than the model.
262k tokens is not much at all. with ~5 characters per token, that's only 1.3 MB of plaintext.
The providers must have a more efficient approach. Most cache every request for 12+ hours, and they certainly can't spare 100GB of ram per request for 12 hours.