logoalt Hacker News

jychangtoday at 12:13 PM1 replyview on HN

Well, the actual inference providers put each expert on its own single GPU. Deepseek explicitly does this.

Read-only parameters is also usually the majority of space. Deepseek is 700GB of params. Meanwhile kv cache is small (Deepseek is about 7GB at max context) and ssm/conv1d cache is even smaller- IIRC Qwen 3.5 is 146MB per token regardless of context size. Not sure about how Mamba-3 works, but I suspect read-only parameters are still a significant amount of memory bandwidth.

I guess the question isn't whether compute is 1:1 with memory, but rather if you run out of compute before you run out of vram adding more users.


Replies

zozbot234today at 2:30 PM

> Well, the actual inference providers put each expert on its own single GPU.

Experts are usually chosen on a per-layer basis, not just by token, so I'd think this requires having lots of GPU's to make it worthwhile. You could do it with a single physical GPU by switching expert-layer mixes in a round-robin fashion after the batch for any single expert-layer mix is completed (essentially a refined version of expert offloading). But still, not easy.