Getting ~36-33 tok/s (see the "S_TG t/s" column) on a 24GB Radeon RX 7900 XTX using llama.cpp's Vulkan backend:
$ llama-server --version
version: 8851 (e365e658f)
$ llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 1000 | 128 | 1 | 1128 | 1.529 | 654.11 | 3.470 | 36.89 | 4.999 | 225.67 |
| 2000 | 128 | 1 | 2128 | 3.064 | 652.75 | 3.498 | 36.59 | 6.562 | 324.30 |
| 4000 | 128 | 1 | 4128 | 6.180 | 647.29 | 3.535 | 36.21 | 9.715 | 424.92 |
| 8000 | 128 | 1 | 8128 | 12.477 | 641.16 | 3.582 | 35.73 | 16.059 | 506.12 |
| 16000 | 128 | 1 | 16128 | 25.849 | 618.98 | 3.667 | 34.91 | 29.516 | 546.42 |
| 32000 | 128 | 1 | 32128 | 57.201 | 559.43 | 3.825 | 33.47 | 61.026 | 526.47 |~25-26 tok/s with ROCm using the same card, llama.cpp b8884:
$ llama-batched-bench -dev ROCm1 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 1000 | 128 | 1 | 1128 | 1.034 | 966.90 | 4.851 | 26.39 | 5.885 | 191.67 |
| 2000 | 128 | 1 | 2128 | 2.104 | 950.38 | 4.853 | 26.38 | 6.957 | 305.86 |
| 4000 | 128 | 1 | 4128 | 4.269 | 937.00 | 4.876 | 26.25 | 9.145 | 451.40 |
| 8000 | 128 | 1 | 8128 | 8.962 | 892.69 | 4.912 | 26.06 | 13.873 | 585.88 |
| 16000 | 128 | 1 | 16128 | 19.673 | 813.31 | 4.996 | 25.62 | 24.669 | 653.78 |
| 32000 | 128 | 1 | 32128 | 46.304 | 691.09 | 5.122 | 24.99 | 51.426 | 624.75 |
Getting ~44-40 tok/s on 24GB RTX 3090 (llama.cpp version 8884, same llama-batched-bench call):
Edit: Model gets stuck in infinite loops at this quantization level. I've also tried Q5_K_M quantization (fits up to 51968 context length), which seems more robust.