MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.
The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.
A few days ago I switched again from Qwen3.6 to Gemma 4 - for personal use I've experienced better average performance with the 26B version of the latter than the 27B of the former.
For someone who's been running local models for a long while, these are very very exciting times.
There's also a growing interest on integrating DFlash: https://github.com/ggml-org/llama.cpp/issues/21978, I can't wait to see how it will compare against MTP
I'd love to see this in oMLX too. It has been a rather nice tool
I don’t exactly know where MTP inference fits within the inference stack, but does someone know whether it’s possible to implement it for the MLX universe?
I have a dumb performance question.
Why when asking a model to change text in a minor way; are we not asking it to generate the operational transformations necessary to modify the text, and then just executing the ot on the existing text vs reproducing every token? Maybe tools are doing that more than I realize?
Thanks for the link,it took qwen3.6-27B-q8 w/256k context on my RTX A6000 from ~20t/s to 55t/s. Prefill is mysteriously slower however, but prefill is so much faster still that I think I'm still bottlenecked on output most of the time.
yet, still mostly useless.
Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.
There is a newer PR which will probably be merged soon: https://github.com/ggml-org/llama.cpp/pull/22673