I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context.
This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.
I don't have enough system RAM to properly handle the large context windows so I don't use local models.
# 1,257 tokens 17s 72.18 t/s
$env:CUDA_DEVICE_SCHEDULE = "SPIN"
cd D:\src\llama.cpp\
.\build\bin\Release\llama-server.exe `
--port 8080 `
--host 127.0.0.1 `
-m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
-fitt 2048 `
-c 98304 `
-n 32768 `
-fa on `
-np 1 `
--kv-unified `
-ctk q8_0 `
-ctv q8_0 `
-ctkd q8_0 `
-ctvd q8_0 `
-ctxcp 64 `
--mlock `
--no-warmup `
--spec-type draft-mtp `
--spec-draft-n-max 2 `
--spec-draft-p-min 0.1 `
--chat-template-kwargs '{\"preserve_thinking\": true}' `
--temp 0.6 `
--top-p 0.95 `
--top-k 20 `
--min-p 0.0 `
--presence-penalty 0.0 `
--repeat-penalty 1.0I get over 100 tok/s sustained on my M4 Max and M5 Max, in MacBook Pro's. LM Studio + MLX.
Can you comment on the quality and accuracy of it? People have managed to run Gemma 26b without GPU on old CPUs but I don't think quality is anywhere close to what Gemma 12b offers.
That's a quant 4 which the thread OP specifically called out as rubbish.
The Q4_K_XL bit for those not in the know.
That’s useless without describing WHY you chose those flags, and how you did the optimisation…