Their latency measurements comparing Mamba-2 and Mamba-3 are done with a batch size of 128. It doesn...

yorwba • today at 10:06 AM • 1 reply • view on HN

Their latency measurements comparing Mamba-2 and Mamba-3 are done with a batch size of 128. It doesn't seem like Mamba-2 was compute-bound even at that batch size.

Replies

jychang • today at 11:49 AM

Well, Deepseek batch sizes are something like 8192, so 128 isn't much.

https://arxiv.org/html/2412.19437v1 "the batch size per expert is relatively small (usually within 256 tokens)"

alt Hacker News

Replies