> speculative decoding which, generally speaking, is not the same quality as serving the model without it.
I've never heard of ANY speculative decoding that wasn't lossless. If it was lossy it'd be called something else.
This page is just a port of DFLASH to gguf format, it only implements greedy decoding like you said so the outputs will be inferior, but not inferior to greedy decoding on the original model. Tho that's just a matter of implementing temperature, top_k, etc.