There is a newer PR which will probably be merged soon:

tarruda • yesterday at 5:31 PM • 3 replies • view on HN

There is a newer PR which will probably be merged soon: https://github.com/ggml-org/llama.cpp/pull/22673

Replies

entropicdrifter • yesterday at 6:21 PM

Ollama merged a PR for MTP about 2 hours ago, as well:

https://github.com/ollama/ollama/pull/15980

Edit: Seems they also have a pre-release version out with the functionality added: https://github.com/ollama/ollama/releases/tag/v0.23.1-rc0

xlayn • yesterday at 9:47 PM

Ohhhh geee!!! I just applied the patch to my local git copy. You need to use the model on the PR that he submitted, the model is particular because it has extra information that allows the MTP to happen. I have two amd gpus, and qwen3.6 27B qk6 does around 20t/s generation... If I run it only on one I get like 35t/s.

But with this patch I saw 46t/s with qwen3.6 27B q8... this is insane, it's 250% faster than the original speed, there was no gpu I could upgrade to get that kind of boost, amazing!

alt Hacker News

Replies