So much faster inference with no quality degradation? All that for just some small memory overhead (...

pu_pe • yesterday at 5:26 PM • 5 replies • view on HN

So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?

Replies

tarruda • yesterday at 5:39 PM

They also published draft models for E4B and E2B. For those, the draft models are only 78m parameters: https://huggingface.co/google/gemma-4-E4B-it-assistant

coder543 • yesterday at 5:43 PM

MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP model, but it's a manageable amount.

➕ show 1 reply

furyofantares • yesterday at 7:05 PM

Is it really no quality degradation?

I'm curious where my understanding is wrong, but I didn't think you necessarily got the exact same output with how I understand speculative decoding to be used. I thought that if the small model produces tokens that are "good enough", meaning within the top few tokens the larger model produces, they're accepted.

I thought it doesn't necessarily have to produce the exact same token the larger model would have produced to be accepted (and that requiring this would reduce the hit rate by a lot.) Just one the top model could have produced with whatever top-k and temperature settings.

➕ show 2 replies

ac29 • yesterday at 11:57 PM

Memory and compute/energy overhead

moffkalast • yesterday at 9:05 PM

It's based on taking advantage of spare compute if you have it. A tiny model generates a few steps ahead first, then the large one runs batch inference on all of those at once as if you are at that point in time. If they all check out afterwards it jumps ahead, otherwise it discards and goes onto the next one.

Not sure about this implementation, but conceptually it only works well on very capable GPUs for very predictable output. Typical speedup is about 30%, not sure how google is claiming 250% which is ridiculous.

And if you don't have enough compute, then you get negative speedup from all the extra overhead.

alt Hacker News

Replies