logoalt Hacker News

jboss10today at 12:18 AM0 repliesview on HN

I'm running llama-swap in a docker container with nvidia container utis to pass through the GPU. This then runs the correct llama-server command to provide the model I want. I have a folder full of guff s I mount in the container.

But this could be done with just llama-server normally. I don't use any special command, just ensure that it's using the GPU. I've found the default fitting to be good.

From memory:

llama-server -m models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf -fa on -c 128000