logoalt Hacker News

TheCycoONElast Monday at 8:32 PM2 repliesview on HN

I have 32GB of RAM with 16GB VRAM and I haven't had a lot of luck running larger models like this. Are you able to expand on that?


Replies

jboss10today at 12:18 AM

I'm running llama-swap in a docker container with nvidia container utis to pass through the GPU. This then runs the correct llama-server command to provide the model I want. I have a folder full of guff s I mount in the container.

But this could be done with just llama-server normally. I don't use any special command, just ensure that it's using the GPU. I've found the default fitting to be good.

From memory:

llama-server -m models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf -fa on -c 128000

slimlast Monday at 9:03 PM

use llama.cpp with cuda

show 1 reply