logoalt Hacker News

firefly2000yesterday at 8:46 PM1 replyview on HN

If the workload were perfectly parallelizable, your claim would be true. However, if it has serial dependency chains, it is absolutely worth it to compute it quickly and unreliably and verify in parallel


Replies

magicalhippoyesterday at 10:37 PM

This is exactly what speculative decoding for LLMs do, and it can yield a nice boost.

Small, hence fast, model predicts next tokens serially. Then a batch of tokens are validated by the main model in parallel. If there is a missmatch you reject the speculated token at that position and all subsequent speculated tokens, take the correct token from the main model and restart speculation from that.

If the predictions are good and the batch parallelism efficiency is high, you can get a significant boost.

show 1 reply