Nice! Getting ~39 tok/s @ ~60% GPU util. (~170W out of 303W per nvtop).
System info:
$ ./llama-server --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
version: 7897 (3dd95914d)
built with GNU 11.4.0 for Linux x86_64
llama.cpp command-line: $ ./llama-server --host 0.0.0.0 --port 2000 --no-warmup \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on \
--ctx-size 32768Super cool! Also with `--fit on` you don't need `--ctx-size 32768` technically anymore - llama-server will auto determine the max context size!
What am I missing here? I thought this model needs 46GB of unified memory for 4-bit quant. Radeon RX 7900 XTX has 24GB of memory right? Hoping to get some insight, thanks in advance!