why is it that speculative decoding lowers quality? My understanding of it is that you use a small/distilled fast model to predict next token - when it doesn't match, you generate more. Checking against the large model is quick.
This should maintain exactly the quality of the original model, no?
I looked up, and you are correct in regards to the specific algorithm used. In general there are approximate algorithms for speculative decoding.
Greedy decoding means it is still not ready though.
AFAIU It's not that checking against the large model is quick (in the usual P!=NP sense that checking an answer is easier than finding one). It's that you can batch your checks. So you speculate the next 5 tokens, and then you can parallelize the large model running once for the batch of [...,n+1], [...,n+2], [...,n+3], [...,n+4], [...,n+5]. If you guessed right for a prefix, you turned a sequential problem (computing next token from current prefix) into a parallel one (doing multiple prefixes together) that the GPU likes. If you guessed wrong, you have to throw away the suffix starting at the wrong guess, and you wasted some extra energy computing.