MTP support is being addedto llama.cpp, at least for the Qwen models (

zdw • yesterday at 5:00 PM • 10 replies • view on HN

MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.

The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.

Replies

tarruda • yesterday at 5:31 PM

There is a newer PR which will probably be merged soon: https://github.com/ggml-org/llama.cpp/pull/22673

➕ show 3 replies

nzeid • yesterday at 7:34 PM

A few days ago I switched again from Qwen3.6 to Gemma 4 - for personal use I've experienced better average performance with the 26B version of the latter than the 27B of the former.

For someone who's been running local models for a long while, these are very very exciting times.

➕ show 2 replies

egeres • yesterday at 10:59 PM

There's also a growing interest on integrating DFlash: https://github.com/ggml-org/llama.cpp/issues/21978, I can't wait to see how it will compare against MTP

fridder • yesterday at 8:20 PM

I'd love to see this in oMLX too. It has been a rather nice tool

endymi0n • yesterday at 11:26 PM

I don’t exactly know where MTP inference fits within the inference stack, but does someone know whether it’s possible to implement it for the MLX universe?

basch • yesterday at 6:24 PM

I have a dumb performance question.

Why when asking a model to change text in a minor way; are we not asking it to generate the operational transformations necessary to modify the text, and then just executing the ot on the existing text vs reproducing every token? Maybe tools are doing that more than I realize?

➕ show 4 replies

nullc • yesterday at 11:58 PM

Thanks for the link,it took qwen3.6-27B-q8 w/256k context on my RTX A6000 from ~20t/s to 55t/s. Prefill is mysteriously slower however, but prefill is so much faster still that I think I'm still bottlenecked on output most of the time.

➕ show 1 reply

EGreg • yesterday at 5:15 PM

How does this get added in practice?

➕ show 1 reply

dakolli • yesterday at 5:04 PM

yet, still mostly useless.

WhitneyLand • yesterday at 5:19 PM

Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.

➕ show 1 reply

alt Hacker News

Replies