logoalt Hacker News

hedgehogyesterday at 9:31 PM1 replyview on HN

If I did a proper benchmark I think the numbers would be what you got. Minimax M2.7 is also surprisingly not that slow, and in some ways faster as it seems to get things right with less thinking output. (around 140 t/s prefill and 23 t/s generation).


Replies

anonym29yesterday at 11:06 PM

The problem with M2.7 is that it's full GQA, meaning quadratic attention. It does start fast, but by 64k tokens deep, the version I'm running (Unsloth's UD IQ2_XXS) pp512 drops 95% from 261.3 t/s (0 context depth) to 13.1 t/s. q8_0 KV does help, still hitting 57.4 t/s at 64k depth vs 258.3 t/s at 0 depth. TG's retention rates are better, but still approaching single digit even with q8_0 KV cache by 64k depth.

That said, it was my favorite model when I valued output quality above all else, at least up until the new Qwen 3.6 27B, which I'm currently playing with.

I suspect I will like Qwen 3.6 122B A10B a LOT, maybe even better than M2.7.