For speculative decoding, wouldn’t this be of limited use for frontier models that don’t have the sa...

btown • yesterday at 12:39 PM • 2 replies • view on HN

For speculative decoding, wouldn’t this be of limited use for frontier models that don’t have the same tokenizer as Llama 3.1? Or would it be so good that retokenization/bridging would be worth it?

Replies

Zetaphor • yesterday at 1:21 PM

My understanding as well is that speculative decoding only works with a smaller quant of the same model. You're using the faster sampling of the smaller models representation of the larger models weights in order to attempt to accurately predict its token output. This wouldn't work cross-model as the token probabilities are completely different.

➕ show 2 replies

vessenes • yesterday at 1:35 PM

I think they’d commission a quant directly. Benefits go down a lot when you leave model families.

alt Hacker News

Replies