logoalt Hacker News

ketchup32613today at 2:41 PM1 replyview on HN

You're not wrong about that. Speculative decoding does not affect the quality of tokens generated, as each token has to be verified by the parent model before it is output.

Each of the tokens generated by the draft model has to be verified by the parent/original model, but if this acceptance rate falls, then the speedup from speculative decoding would be eliminated. This acceptance rate, and more directly the speedup from draft models, is what "performance" refer s to in the article.


Replies

kbumsiktoday at 2:51 PM

So the draft model's performance is directly linked to the overall speed. Thank you for the explanation!

By the way, can it be slower than without speculative decoding in worst case then?

show 1 reply