logoalt Hacker News

Accelerating Gemma 4: faster inference with multi-token prediction drafters

443 pointsby amrrsyesterday at 4:14 PM198 commentsview on HN

Comments

WarmWashyesterday at 6:27 PM

I don't see it talked about much, but Gemma (and gemini) use enormously less tokens per output than other models, while still staying within arms reach of top benchmark performance.

It's not uncommon to see a gemma vs qwen comparison, where qwen does a bit better, but spent 22 minutes on the task, while gemma aligned the buttons wrong, but only spent 4 minutes on the same prompt. So taken at face value, gemma is now under performing leading open models by 5-10%, but doing it in 1/10th the time.

show 2 replies
zdwyesterday at 5:00 PM

MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.

The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.

show 5 replies
msp26yesterday at 6:06 PM

Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic.

However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.

show 1 reply
skybrianyesterday at 5:07 PM

Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.

show 4 replies
aleksiy123yesterday at 6:54 PM

I’m starting to think that googles strategy is a bit different then the other frontier providers.

Focusing more on performance to compute efficiency over pure performance. And maybe that’s why Gemini is (seemingly) lagging behind?

Other providers hitting capacity and hitting the limits subsidising their inference.

Google strategy seems to be about scaling and distributing these models to their existing billions of users.

show 1 reply
christina97yesterday at 5:31 PM

I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment.

I tried first with Qwen but it was unstable and had ridiculously long thinning traces!

show 1 reply
theseyesterday at 4:49 PM

Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.

show 4 replies
Patrick_Devineyesterday at 6:17 PM

In my testing the Gemma 4 31b model had the biggest speed boost in Ollama w/ the MLX runner for coding tasks (at about 2x). Unfortunately you'll need a pretty beefy Mac to run it because quantization really hurts the acceptance rate. The three other smaller models didn't perform as well because the validation time of the draft model ate up most of the performance gains. I'm still trying to tune things to see if I can get better performance.

You can try it out with Ollama 0.23.1 by running `ollama run gemma4:31b-coding-mtp-bf16`.

julianlamyesterday at 6:01 PM

Really excited to try this once it is merged into llama.cpp.

Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing.

Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)

show 1 reply
vhiremath4yesterday at 6:18 PM

So this is like branch prediction for operating systems? Except we have probability baked into the model itself so it’s even more reliable.

brikymyesterday at 9:23 PM

I wonder what latency and tok/s this model on Groq or Cerebras would be capable of. I have a couple LLM driven games [1][2] where speed is really important to the experience. Currently the best performance I can get is the gpt-oss models on Groq or Cerebras but they need quite a bit of extra context and tools to correct for mistakes. I'm making a bet I'll be able to get the same performance much cheaper in the next few months.

[1] https://sleuththetruth.com [2] https://lextension.net/

nolist_policyyesterday at 9:11 PM

Works great in the latest version of Google AI Edge Gallery: https://github.com/google-ai-edge/gallery/releases

wrxdyesterday at 9:36 PM

I'm not sure I understand how this work https://huggingface.co/google/gemma-4-E4B-it-assistant has 78.8M parameters while the standard variant https://huggingface.co/google/gemma-4-E4B-it has 8B parameters.

Is gemma-4-E4B-it-assistant a model I can use stand-alone or a model I need to use in combination with gemma-4-E4B-it?

show 1 reply
mchusmayesterday at 4:41 PM

I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?

show 6 replies
netduryesterday at 7:32 PM

I am getting 21 t/s on Fold 7, 21 x 1.8 = 37.8 t/s compared to M1 Max's 54 t/s, that is impressive

recsv-heredocyesterday at 5:37 PM

CloudFlare offers excellent service for many of the open-weights models. It's fast, cheap and simple to set up. Can highly suggest as an LLM provider.

They serve gemma-4-26b-a4b-it.

show 1 reply
regexorcistyesterday at 7:03 PM

Sounds like a game changer if I see that kind of speed up on my hardware. So far I've prefered Qwen 3.6 because of its better tool handling, even though Gemma 4 is faster, but I saw they've updated the model template and that's supposed to be better now. Looking forward to trying this with llama.cpp.

el_ismayesterday at 7:38 PM

How is this different from the speculative decoding that we had before?

You could pair a big and small model like qwen 32b with qwen 4b and had that same dynamic of the small model generating tokens and the big one "certifiying" them.

The blog says something about re-using the big model's data?

show 3 replies
AbuAssaryesterday at 6:09 PM

these are the updated models:

google/gemma-4-31B-it-assistant

google/gemma-4-26B-A4B-it-assistant

google/gemma-4-E4B-it-assistant

google/gemma-4-E2B-it-assistant

show 1 reply
joakleafyesterday at 8:39 PM

Seems like a pull request for vLLM was just approved a few minutes ago:

https://github.com/vllm-project/vllm/pull/41745

("Add Gemma4 MTP speculative decoding support")

disiplusyesterday at 4:54 PM

nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.

julianlamyesterday at 7:22 PM

Does this mean there will be new Gemma 4 models released with MTP, or are they already available in existing models + quants?

show 2 replies
pu_peyesterday at 5:26 PM

So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?

show 3 replies
sigmaryesterday at 6:37 PM

>try them directly on Google AI Edge Gallery for Android or iOS.

I'm not seeing any update to the app on my android phone... maybe later today?

>We’ve published an in-depth technical explainer

I was expected a pdf link, but this goes to a brief article on twitter/X. lol, okay...

tannhaeuseryesterday at 7:34 PM

Tested gemma4 26 MoE 4bit quantisized gguf on llama.cpp following these guides with mmap'd I/O on a 16GB MBP and it was unbearably slow (0.0 t/s).

ThouYSyesterday at 9:48 PM

don't know about this guy, but qwen3.6:27b with the UD 4bit quant and little-coder/pi has been amazing. the first local LLM experience that can do actual meaningful work

show 1 reply
larnonyesterday at 8:56 PM

Anyone tried this with vLLM yet? I am confused on how to turn this on tbh.

deskamessyesterday at 5:34 PM

Did DeepSeek come up with MTP? It was listed prominently in their recent paper as being carried forward from the previous release.

shay_keryesterday at 5:18 PM

curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron

https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...

show 1 reply
noashavityesterday at 7:45 PM

Gemma4:e4b is a huge upgrade

franzeyesterday at 6:51 PM

if someone wants to work with gemma and dont deal with ollama or configs - there is (my baby) https://airplane-ai.franzai.com/

Beta but useable

show 2 replies
brcmthrowawayyesterday at 5:35 PM

Is Google's local model strategy tuned to pegging down big AI cloud labs a notch?

simianwordsyesterday at 7:03 PM

Gemma 4 is really a beast. The 31B version is totally usable like for cases when I'm bored without internet

ActorNightlyyesterday at 6:13 PM

I found that Gemma 4:26b makes way more mistakes compared to Qwen and Gemma 3. Gemma3 27b QAT was my goto for some time as this was quite fast. Qwen is still king for a balance of accuracy and inference speed.

Gemma:31b was more accurate but speed was horrendous.

m3kw9yesterday at 5:35 PM

ok so? Anyone got a verdict/review?

rahimnathwaniyesterday at 6:27 PM

[dead]

Gormersyesterday at 9:14 PM

[flagged]