Specdec works well for code, so the prompt I used was "Write a React TypeScript demo".
prompt eval time = 313.70 ms / 40 tokens (7.84 ms per token, 127.51 tokens per second)
eval time = 46278.35 ms / 913 tokens (50.69 ms per token, 19.73 tokens per second)
total time = 46592.05 ms / 953 tokens
draft acceptance rate = 0.87616 (757 accepted / 864 generated)
The draft model cannot affect the quality of the output. A good draft model makes token generation faster, and a bad one would slow things down, but the quality will be the same as the main model either way.
Even though big, dense models aren't fashionable anymore, they are perfect for specdec, so it can be fun to see the speedup that is possible.
I can get about 20 tokens per second on the DGX Spark using llama-3.3-70B with no loss in quality compared to the model you were benchmarking:
Specdec works well for code, so the prompt I used was "Write a React TypeScript demo". The draft model cannot affect the quality of the output. A good draft model makes token generation faster, and a bad one would slow things down, but the quality will be the same as the main model either way.