Accelerating Gemma 4: faster inference with multi-token prediction drafters

443 points • by amrrs • yesterday at 4:14 PM • 198 comments • view on HN

Comments

I don't see it talked about much, but Gemma (and gemini) use enormously less tokens per output than other models, while still staying within arms reach of top benchmark performance.

It's not uncommon to see a gemma vs qwen comparison, where qwen does a bit better, but spent 22 minutes on the task, while gemma aligned the buttons wrong, but only spent 4 minutes on the same prompt. So taken at face value, gemma is now under performing leading open models by 5-10%, but doing it in 1/10th the time.

➕ show 2 replies

zdw • yesterday at 5:00 PM

MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.

The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.

➕ show 5 replies

msp26 • yesterday at 6:06 PM

Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic.

However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.

➕ show 1 reply

skybrian • yesterday at 5:07 PM

Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.

➕ show 4 replies

aleksiy123 • yesterday at 6:54 PM

I’m starting to think that googles strategy is a bit different then the other frontier providers.

Focusing more on performance to compute efficiency over pure performance. And maybe that’s why Gemini is (seemingly) lagging behind?

Other providers hitting capacity and hitting the limits subsidising their inference.

Google strategy seems to be about scaling and distributing these models to their existing billions of users.

➕ show 1 reply

christina97 • yesterday at 5:31 PM

I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment.

I tried first with Qwen but it was unstable and had ridiculously long thinning traces!

➕ show 1 reply

these • yesterday at 4:49 PM

Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.

➕ show 4 replies

Patrick_Devine • yesterday at 6:17 PM

In my testing the Gemma 4 31b model had the biggest speed boost in Ollama w/ the MLX runner for coding tasks (at about 2x). Unfortunately you'll need a pretty beefy Mac to run it because quantization really hurts the acceptance rate. The three other smaller models didn't perform as well because the validation time of the draft model ate up most of the performance gains. I'm still trying to tune things to see if I can get better performance.

You can try it out with Ollama 0.23.1 by running `ollama run gemma4:31b-coding-mtp-bf16`.

julianlam • yesterday at 6:01 PM

Really excited to try this once it is merged into llama.cpp.

Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing.

Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)

➕ show 1 reply

vhiremath4 • yesterday at 6:18 PM

So this is like branch prediction for operating systems? Except we have probability baked into the model itself so it’s even more reliable.

brikym • yesterday at 9:23 PM

I wonder what latency and tok/s this model on Groq or Cerebras would be capable of. I have a couple LLM driven games [1][2] where speed is really important to the experience. Currently the best performance I can get is the gpt-oss models on Groq or Cerebras but they need quite a bit of extra context and tools to correct for mistakes. I'm making a bet I'll be able to get the same performance much cheaper in the next few months.

[1] https://sleuththetruth.com [2] https://lextension.net/

nolist_policy • yesterday at 9:11 PM

Works great in the latest version of Google AI Edge Gallery: https://github.com/google-ai-edge/gallery/releases

wrxd • yesterday at 9:36 PM

I'm not sure I understand how this work https://huggingface.co/google/gemma-4-E4B-it-assistant has 78.8M parameters while the standard variant https://huggingface.co/google/gemma-4-E4B-it has 8B parameters.

Is gemma-4-E4B-it-assistant a model I can use stand-alone or a model I need to use in combination with gemma-4-E4B-it?

➕ show 1 reply

mchusma • yesterday at 4:41 PM

I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?

➕ show 6 replies

netdur • yesterday at 7:32 PM

I am getting 21 t/s on Fold 7, 21 x 1.8 = 37.8 t/s compared to M1 Max's 54 t/s, that is impressive

recsv-heredoc • yesterday at 5:37 PM

CloudFlare offers excellent service for many of the open-weights models. It's fast, cheap and simple to set up. Can highly suggest as an LLM provider.

They serve gemma-4-26b-a4b-it.

➕ show 1 reply

regexorcist • yesterday at 7:03 PM

Sounds like a game changer if I see that kind of speed up on my hardware. So far I've prefered Qwen 3.6 because of its better tool handling, even though Gemma 4 is faster, but I saw they've updated the model template and that's supposed to be better now. Looking forward to trying this with llama.cpp.

el_isma • yesterday at 7:38 PM

How is this different from the speculative decoding that we had before?

You could pair a big and small model like qwen 32b with qwen 4b and had that same dynamic of the small model generating tokens and the big one "certifiying" them.

The blog says something about re-using the big model's data?

➕ show 3 replies

AbuAssar • yesterday at 6:09 PM

these are the updated models:

google/gemma-4-31B-it-assistant

google/gemma-4-26B-A4B-it-assistant

google/gemma-4-E4B-it-assistant

google/gemma-4-E2B-it-assistant

➕ show 1 reply

joakleaf • yesterday at 8:39 PM

Seems like a pull request for vLLM was just approved a few minutes ago:

https://github.com/vllm-project/vllm/pull/41745

("Add Gemma4 MTP speculative decoding support")

nalinidash • yesterday at 5:20 PM

technical details are here: https://x.com/googlegemma/status/2051694045869879749

disiplus • yesterday at 4:54 PM

nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.

julianlam • yesterday at 7:22 PM

Does this mean there will be new Gemma 4 models released with MTP, or are they already available in existing models + quants?

➕ show 2 replies

pu_pe • yesterday at 5:26 PM

So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?

➕ show 3 replies

sigmar • yesterday at 6:37 PM

>try them directly on Google AI Edge Gallery for Android or iOS.

I'm not seeing any update to the app on my android phone... maybe later today?

>We’ve published an in-depth technical explainer

I was expected a pdf link, but this goes to a brief article on twitter/X. lol, okay...

tannhaeuser • yesterday at 7:34 PM

Tested gemma4 26 MoE 4bit quantisized gguf on llama.cpp following these guides with mmap'd I/O on a 16GB MBP and it was unbearably slow (0.0 t/s).

ThouYS • yesterday at 9:48 PM

don't know about this guy, but qwen3.6:27b with the UD 4bit quant and little-coder/pi has been amazing. the first local LLM experience that can do actual meaningful work

➕ show 1 reply

larnon • yesterday at 8:56 PM

Anyone tried this with vLLM yet? I am confused on how to turn this on tbh.

deskamess • yesterday at 5:34 PM

Did DeepSeek come up with MTP? It was listed prominently in their recent paper as being carried forward from the previous release.

shay_ker • yesterday at 5:18 PM

curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron

https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...

➕ show 1 reply

noashavit • yesterday at 7:45 PM

Gemma4:e4b is a huge upgrade

franze • yesterday at 6:51 PM

if someone wants to work with gemma and dont deal with ollama or configs - there is (my baby) https://airplane-ai.franzai.com/

Beta but useable

➕ show 2 replies

brcmthrowaway • yesterday at 5:35 PM

Is Google's local model strategy tuned to pegging down big AI cloud labs a notch?

simianwords • yesterday at 7:03 PM

Gemma 4 is really a beast. The 31B version is totally usable like for cases when I'm bored without internet

ActorNightly • yesterday at 6:13 PM

I found that Gemma 4:26b makes way more mistakes compared to Qwen and Gemma 3. Gemma3 27b QAT was my goto for some time as this was quite fast. Qwen is still king for a balance of accuracy and inference speed.

Gemma:31b was more accurate but speed was horrendous.

m3kw9 • yesterday at 5:35 PM

ok so? Anyone got a verdict/review?

rahimnathwani • yesterday at 6:27 PM

[dead]

Gormers • yesterday at 9:14 PM

[flagged]

alt Hacker News

Accelerating Gemma 4: faster inference with multi-token prediction drafters

Comments