logoalt Hacker News

greenavocadotoday at 4:23 PM4 repliesview on HN

I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context.

This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.

I don't have enough system RAM to properly handle the large context windows so I don't use local models.

  # 1,257 tokens 17s 72.18 t/s

  $env:CUDA_DEVICE_SCHEDULE = "SPIN"
  cd D:\src\llama.cpp\
  .\build\bin\Release\llama-server.exe `
    --port 8080 `
    --host 127.0.0.1 `
    -m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
    -fitt 2048 `
    -c 98304 `
    -n 32768 `
    -fa on `
    -np 1 `
    --kv-unified `
    -ctk q8_0 `
    -ctv q8_0 `
    -ctkd q8_0 `
    -ctvd q8_0 `
    -ctxcp 64 `
    --mlock `
    --no-warmup `
    --spec-type draft-mtp `
    --spec-draft-n-max 2 `
    --spec-draft-p-min 0.1 `
    --chat-template-kwargs '{\"preserve_thinking\": true}' `
    --temp 0.6 `
    --top-p 0.95 `
    --top-k 20 `
    --min-p 0.0 `
    --presence-penalty 0.0 `
    --repeat-penalty 1.0

Replies

themanualstatestoday at 5:02 PM

That’s useless without describing WHY you chose those flags, and how you did the optimisation…

show 1 reply
nateb2022today at 4:57 PM

I get over 100 tok/s sustained on my M4 Max and M5 Max, in MacBook Pro's. LM Studio + MLX.

show 1 reply
ridiculous_leketoday at 5:43 PM

Can you comment on the quality and accuracy of it? People have managed to run Gemma 26b without GPU on old CPUs but I don't think quality is anywhere close to what Gemma 12b offers.

mattmansertoday at 5:06 PM

That's a quant 4 which the thread OP specifically called out as rubbish.

The Q4_K_XL bit for those not in the know.

show 2 replies