The problem with M2.7 is that it's full GQA, meaning quadratic attention. It does start fast, b...

anonym29 • yesterday at 11:06 PM • 0 replies • view on HN

The problem with M2.7 is that it's full GQA, meaning quadratic attention. It does start fast, but by 64k tokens deep, the version I'm running (Unsloth's UD IQ2_XXS) pp512 drops 95% from 261.3 t/s (0 context depth) to 13.1 t/s. q8_0 KV does help, still hitting 57.4 t/s at 64k depth vs 258.3 t/s at 0 depth. TG's retention rates are better, but still approaching single digit even with q8_0 KV cache by 64k depth.

That said, it was my favorite model when I valued output quality above all else, at least up until the new Qwen 3.6 27B, which I'm currently playing with.

I suspect I will like Qwen 3.6 122B A10B a LOT, maybe even better than M2.7.

alt Hacker News