Their latency measurements comparing Mamba-2 and Mamba-3 are done with a batch size of 128. It doesn't seem like Mamba-2 was compute-bound even at that batch size.
Well, Deepseek batch sizes are something like 8192, so 128 isn't much.
https://arxiv.org/html/2412.19437v1 "the batch size per expert is relatively small (usually within 256 tokens)"
Well, Deepseek batch sizes are something like 8192, so 128 isn't much.
https://arxiv.org/html/2412.19437v1 "the batch size per expert is relatively small (usually within 256 tokens)"