> Everyone groups all the requests into a batch, and the GPU computes them together. You'r...

zozbot234 • today at 9:56 AM • 1 reply • view on HN

> Everyone groups all the requests into a batch, and the GPU computes them together.

You're only saving on fetching read-only parameters, and not even on that if you're using MoE models where each inference in the batch might require a different expert (unless you rearrange batches so that sharing experts becomes more likely, but that's difficult since experts change per-token or even per-layer). Everything else - KV-cache, activations - gets multiplied by your batch size. You scale both compute and memory pressure by largely the same amount. Yes, GPUs are great at hiding memory fetch latency, but that applies also to n=1 inference.

Replies

jychang • today at 12:13 PM

Well, the actual inference providers put each expert on its own single GPU. Deepseek explicitly does this.

Read-only parameters is also usually the majority of space. Deepseek is 700GB of params. Meanwhile kv cache is small (Deepseek is about 7GB at max context) and ssm/conv1d cache is even smaller- IIRC Qwen 3.5 is 146MB per token regardless of context size. Not sure about how Mamba-3 works, but I suspect read-only parameters are still a significant amount of memory bandwidth.

I guess the question isn't whether compute is 1:1 with memory, but rather if you run out of compute before you run out of vram adding more users.

➕ show 1 reply

alt Hacker News

Replies