> What's the tok/s you get these days?
I ran llama-bench a couple of weeks ago when there was a big speed improvement on llama.cpp (https://github.com/ggml-org/llama.cpp/pull/20361#issuecommen...):
% llama-bench -m ~/ml-models/huggingface/ubergarm/Qwen3.5-397B-A17B-GGUF/smol-IQ2_XS/Qwen3.5-397B-A17B-smol-IQ2_XS-00001-of-00004.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.008 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 134217.73 MB
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | -: | --------------: | -------------------: |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 | 189.67 ± 1.98 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 | 19.98 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d10000 | 168.92 ± 0.55 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d10000 | 18.93 ± 0.02 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d20000 | 152.42 ± 0.22 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d20000 | 17.87 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d30000 | 139.37 ± 0.28 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d30000 | 17.12 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d40000 | 128.38 ± 0.33 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d40000 | 16.38 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d50000 | 118.07 ± 0.55 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d50000 | 15.66 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d60000 | 108.44 ± 0.38 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d60000 | 14.98 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d70000 | 98.85 ± 0.18 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d70000 | 14.36 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d80000 | 91.39 ± 0.49 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d80000 | 13.84 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d90000 | 85.76 ± 0.24 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d90000 | 13.30 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d100000 | 80.19 ± 0.83 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d100000 | 12.82 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d150000 | 54.46 ± 0.33 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d150000 | 10.17 ± 0.09 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d200000 | 47.05 ± 0.15 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d200000 | 9.04 ± 0.02 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d250000 | 40.71 ± 0.26 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d250000 | 8.01 ± 0.02 |
build: d28961d81 (8299)
So it starts at 20 tps tg and 190 tps pp with empty context and ends at 8 tps tg and 40 tps pp with 250k prefill.I suspect that there are still a lot of optimizations to be implemented for Qwen 3.5 on llama.cpp, wouldn't be surprised to reach 25 tps in a few months.
> You're the guy who launched Neovim!
That's me ;D
> I use it every day.
So do I for the past 12 years! Though I admit in the past year I greatly reduced the amount of code I write by hand :/
Thank you for NeoVim! I also use it every day, mostly for thinking / text / markdown though these days.
Have you compared against MLX? Sometimes I’m getting much faster responses but it feels like the quality is worse (eg tool calls not working, etc)
That's surprisingly fast. Thanks for sharing.
Apologies to others for the offtopic comment, but thank you so much for neovim. I started using Vim 25 years ago and I almost don't know how to type without a proper Vi-based editor. I don't write as much code these days, but I write other stuff (which definitely needs to be mostly hand written) in neovim and I feel so grateful that this tool is still receiving love and getting new updates.