It has nothing to do with local RAM usage. But a million tokens of LLM context is decidedly not 5mb....

vlovich123 • today at 3:02 AM • 1 reply • view on HN

It has nothing to do with local RAM usage. But a million tokens of LLM context is decidedly not 5mb.

The rough estimate is 2 * L * H_kv * D * bytes per element

Where:

* L = number of layers * H_kv = # of KV heads * D = head dimension * factor of 2 = keys + values

The dominant factor here is typically 2 * H_kv * D since it’s usually at least 2048 bytes. Per token.

For Llama3 7B youre looking at 128gib if you’re context is really 1M (not that that particular model supports a context so big). DeepSeek4 uses something called sparse attention so the above calculus is improved - 1M of context would use 5-10GiB.

But regardless of the details, you’re off by several orders of magnitude.

Replies

tujux • today at 3:19 AM

Pretty sure we're talking about the output text, not the tensors.

➕ show 1 reply

alt Hacker News

Replies