logoalt Hacker News

pu_peyesterday at 5:26 PM5 repliesview on HN

So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?


Replies

tarrudayesterday at 5:39 PM

They also published draft models for E4B and E2B. For those, the draft models are only 78m parameters: https://huggingface.co/google/gemma-4-E4B-it-assistant

coder543yesterday at 5:43 PM

MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP model, but it's a manageable amount.

show 1 reply
furyofantaresyesterday at 7:05 PM

Is it really no quality degradation?

I'm curious where my understanding is wrong, but I didn't think you necessarily got the exact same output with how I understand speculative decoding to be used. I thought that if the small model produces tokens that are "good enough", meaning within the top few tokens the larger model produces, they're accepted.

I thought it doesn't necessarily have to produce the exact same token the larger model would have produced to be accepted (and that requiring this would reduce the hit rate by a lot.) Just one the top model could have produced with whatever top-k and temperature settings.

show 2 replies
ac29yesterday at 11:57 PM

Memory and compute/energy overhead

moffkalastyesterday at 9:05 PM

It's based on taking advantage of spare compute if you have it. A tiny model generates a few steps ahead first, then the large one runs batch inference on all of those at once as if you are at that point in time. If they all check out afterwards it jumps ahead, otherwise it discards and goes onto the next one.

Not sure about this implementation, but conceptually it only works well on very capable GPUs for very predictable output. Typical speedup is about 30%, not sure how google is claiming 250% which is ridiculous.

And if you don't have enough compute, then you get negative speedup from all the extra overhead.