No you did not. You got 207 tok/s on an RTX 3090 with speculative decoding which, generally speaking, is not the same quality as serving the model without it.
Greedy-only decoding is even worse. There's a reason every public model comes with suggested sampling parameters. When you don't use them, output tends to degrade severely. In your case simply running a 14B model on the same hardware with the tools you compare against would probably be both faster and produce output of higher quality.
why is it that speculative decoding lowers quality? My understanding of it is that you use a small/distilled fast model to predict next token - when it doesn't match, you generate more. Checking against the large model is quick.
This should maintain exactly the quality of the original model, no?
> speculative decoding which, generally speaking, is not the same quality as serving the model without it.
I've never heard of ANY speculative decoding that wasn't lossless. If it was lossy it'd be called something else.
This page is just a port of DFLASH to gguf format, it only implements greedy decoding like you said so the outputs will be inferior, but not inferior to greedy decoding on the original model. Tho that's just a matter of implementing temperature, top_k, etc.
Speculative decoding doesn't degrade output quality. The distribution it produces is exactly the same if you do it correctly. The original paper on it clearly talks about this. [0]
Speculative decoding is the same as speculative execution on CPUs. As long as you walk back on an incorrect prediction (i.e. the speculated tokens weren't accepted) then everything is mathematically exactly the same. It just uses more parallelism (specificslly higher arithmetic intensity).
[0] https://arxiv.org/abs/2211.17192