logoalt Hacker News

zozbot234yesterday at 2:41 PM0 repliesview on HN

The needed memory capacity depends on active parameters (not the same as total with a MoE model) and context length for the purpose of KV caching. Even then the KV cache can be pushed to system RAM and even farther out to swap, since writes to it are small (just one KV vector per token).