Even though big, dense models aren't fashionable anymore, they are perfect for specdec, so it c...

coder543 • today at 2:28 AM • 0 replies • view on HN

Even though big, dense models aren't fashionable anymore, they are perfect for specdec, so it can be fun to see the speedup that is possible.

I can get about 20 tokens per second on the DGX Spark using llama-3.3-70B with no loss in quality compared to the model you were benchmarking:

    llama-server \
        --model      llama-3.3-70b-instruct-ud-q4_k_xl.gguf \
        --model-draft llama-3.2-1b-instruct-ud-q8_k_xl.gguf \
        --ctx-size      80000 \
        --ctx-size-draft 4096 \
        --draft-min 1 \
        --draft-max 8 \
        --draft-p-min 0.65 \
        -ngl 999 \
        --flash-attn on \
        --parallel 1 \
        --no-mmap \
        --jinja \
        --temp 0.0 \
        -fit off

Specdec works well for code, so the prompt I used was "Write a React TypeScript demo".

    prompt eval time = 313.70 ms / 40 tokens (7.84 ms per token, 127.51 tokens per second)
    eval time = 46278.35 ms / 913 tokens (50.69 ms per token, 19.73 tokens per second)
    total time = 46592.05 ms / 953 tokens
    draft acceptance rate = 0.87616 (757 accepted / 864 generated)

The draft model cannot affect the quality of the output. A good draft model makes token generation faster, and a bad one would slow things down, but the quality will be the same as the main model either way.

alt Hacker News