I'm running llama-swap in a docker container with nvidia container utis to pass through the GPU. This then runs the correct llama-server command to provide the model I want. I have a folder full of guff s I mount in the container.
But this could be done with just llama-server normally. I don't use any special command, just ensure that it's using the GPU. I've found the default fitting to be good.
From memory:
llama-server -m models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf -fa on -c 128000
I'm running llama-swap in a docker container with nvidia container utis to pass through the GPU. This then runs the correct llama-server command to provide the model I want. I have a folder full of guff s I mount in the container.
But this could be done with just llama-server normally. I don't use any special command, just ensure that it's using the GPU. I've found the default fitting to be good.
From memory:
llama-server -m models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf -fa on -c 128000