I'm guessing that's ~26 decode tokens/s for 2-bit or 3-bit quantized Minimax-m2.1 at ...

EnPissant • today at 11:09 AM • 0 replies • view on HN

I'm guessing that's ~26 decode tokens/s for 2-bit or 3-bit quantized Minimax-m2.1 at 0 context, and it only gets worse as the context grows.

I'm also sure your prefill is slow enough to make the model mostly unusable, even at smallish context windows, but entirely at mid to large context.

alt Hacker News