Why do they need to run benchmarks to confirm performance? Can't they run an example prompt and...

ntonozzi • today at 6:55 PM • 1 reply • view on HN

Why do they need to run benchmarks to confirm performance? Can't they run an example prompt and verify they get the exact same output token probabilities for all prompts? The fact that they are not doing this makes me suspicious that they are in fact not doing the exact same thing as vLLM.

It is also a bit weird that they are not incorporating speculative decoding, that seems like a critical performance optimization, especially for decode heavy workloads.

Replies

lukebechtel • today at 7:03 PM

Yes, speculative decoding will make both us and VLLM faster, but we believe it would be a relatively even bump on both sides, so we didn't include it in this comparison. Worth another test!

alt Hacker News

Replies