The token is correct if it matches the one generated by the main model. It works like this: The dr...

Klaus23 • yesterday at 8:50 PM • 1 reply • view on HN

The token is correct if it matches the one generated by the main model. It works like this:

The draft model quickly generates draft-token 1.

The main model then starts working on two tokens in parallel. It calculates token 1 based on the context, and token 2 based on the context + draft-token 1.

Once the two tokens have been generated, you can check whether the draft-token 1 from the draft model matches token 1 from the main model.

If they match, you have just calculated two tokens in the time it takes to generate one, because the calculation was done in parallel. If they do not match, delete token 2 and generate it again. Since you have already generated the correct token 1 with the big model, you can use the context + token 1 (from the main model). This takes more time, but the result is always the same.

Replies

furyofantares • yesterday at 11:06 PM

Models do not generate tokens. They generate probabilities for each token.

Inference parameters select a token using those.

You can just select the top token all the time or you can do it probabilistically.

How you do that in both the speculative decoding and the main inference changes how likely you get the exact same tokens. And then you can choose to accept only if the token matches exactly, or you can choose to accept if it was reasonably likely to be chosen.

Let's say the main model picked the 2nd most likely token and speculative picked the most likely. You can reject that - but you get less speed up. You can accept it, you get more speed up, but you do change the output. You risk the distribution of your outputs not being what you hope.

I am simplifying. I know in https://arxiv.org/pdf/2302.01318 they specify a probability that you reject a token.

alt Hacker News

Replies