logoalt Hacker News

zozbot234today at 4:14 PM1 replyview on HN

Maybe we shouldn't be running these models on laptops with their thermally constrained form factor, and we shouldn't expect quick inference on a par with a large cloud-based platform either, at least not for near-SOTA model quality. It's still worth it to avoid becoming massively reliant on centralized services.


Replies

greenavocadotoday at 4:23 PM

I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context.

This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.

I don't have enough system RAM to properly handle the large context windows so I don't use local models.

  # 1,257 tokens 17s 72.18 t/s

  $env:CUDA_DEVICE_SCHEDULE = "SPIN"
  cd D:\src\llama.cpp\
  .\build\bin\Release\llama-server.exe `
    --port 8080 `
    --host 127.0.0.1 `
    -m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
    -fitt 2048 `
    -c 98304 `
    -n 32768 `
    -fa on `
    -np 1 `
    --kv-unified `
    -ctk q8_0 `
    -ctv q8_0 `
    -ctkd q8_0 `
    -ctvd q8_0 `
    -ctxcp 64 `
    --mlock `
    --no-warmup `
    --spec-type draft-mtp `
    --spec-draft-n-max 2 `
    --spec-draft-p-min 0.1 `
    --chat-template-kwargs '{\"preserve_thinking\": true}' `
    --temp 0.6 `
    --top-p 0.95 `
    --top-k 20 `
    --min-p 0.0 `
    --presence-penalty 0.0 `
    --repeat-penalty 1.0
show 4 replies