You're not wrong about that. Speculative decoding does not affect the quality of tokens generat...

ketchup32613 • today at 2:41 PM • 1 reply • view on HN

You're not wrong about that. Speculative decoding does not affect the quality of tokens generated, as each token has to be verified by the parent model before it is output.

Each of the tokens generated by the draft model has to be verified by the parent/original model, but if this acceptance rate falls, then the speedup from speculative decoding would be eliminated. This acceptance rate, and more directly the speedup from draft models, is what "performance" refer s to in the article.

Replies

kbumsik • today at 2:51 PM

So the draft model's performance is directly linked to the overall speed. Thank you for the explanation!

By the way, can it be slower than without speculative decoding in worst case then?

➕ show 1 reply

alt Hacker News

Replies