I'm not sure that I buy their conclusion that more compute during inference is good.
Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.
With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.
If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.
Focusing on needs of providers isn't a very good long term strategy if you believe compute will eventually move to self hosted and on premises solutions where large batch sizes aren't needed.
Throughput is indeed king for the standard-tier mindshare-capture play. But there are many who would pay multiple times the current cost for agentic systems for engineers and executives, if it meant a meaningful reduction in latency. The economics could work extremely well.
Local has a batch size of 1. If you are already memory bound then you leave compute on the table. Why not use it?
Not sure they target local though…
Their latency measurements comparing Mamba-2 and Mamba-3 are done with a batch size of 128. It doesn't seem like Mamba-2 was compute-bound even at that batch size.
> Everyone groups all the requests into a batch, and the GPU computes them together.
You're only saving on fetching read-only parameters, and not even on that if you're using MoE models where each inference in the batch might require a different expert (unless you rearrange batches so that sharing experts becomes more likely, but that's difficult since experts change per-token or even per-layer). Everything else - KV-cache, activations - gets multiplied by your batch size. You scale both compute and memory pressure by largely the same amount. Yes, GPUs are great at hiding memory fetch latency, but that applies also to n=1 inference.