logoalt Hacker News

zdwyesterday at 5:00 PM10 repliesview on HN

MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.

The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.


Replies

tarrudayesterday at 5:31 PM

There is a newer PR which will probably be merged soon: https://github.com/ggml-org/llama.cpp/pull/22673

show 3 replies
nzeidyesterday at 7:34 PM

A few days ago I switched again from Qwen3.6 to Gemma 4 - for personal use I've experienced better average performance with the 26B version of the latter than the 27B of the former.

For someone who's been running local models for a long while, these are very very exciting times.

show 2 replies
egeresyesterday at 10:59 PM

There's also a growing interest on integrating DFlash: https://github.com/ggml-org/llama.cpp/issues/21978, I can't wait to see how it will compare against MTP

fridderyesterday at 8:20 PM

I'd love to see this in oMLX too. It has been a rather nice tool

endymi0nyesterday at 11:26 PM

I don’t exactly know where MTP inference fits within the inference stack, but does someone know whether it’s possible to implement it for the MLX universe?

baschyesterday at 6:24 PM

I have a dumb performance question.

Why when asking a model to change text in a minor way; are we not asking it to generate the operational transformations necessary to modify the text, and then just executing the ot on the existing text vs reproducing every token? Maybe tools are doing that more than I realize?

show 4 replies
nullcyesterday at 11:58 PM

Thanks for the link,it took qwen3.6-27B-q8 w/256k context on my RTX A6000 from ~20t/s to 55t/s. Prefill is mysteriously slower however, but prefill is so much faster still that I think I'm still bottlenecked on output most of the time.

show 1 reply
EGregyesterday at 5:15 PM

How does this get added in practice?

show 1 reply
dakolliyesterday at 5:04 PM

yet, still mostly useless.

WhitneyLandyesterday at 5:19 PM

Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.

show 1 reply