I have 32GB of RAM with 16GB VRAM and I haven't had a lot of luck running larger models like th...

TheCycoONE • last Monday at 8:32 PM • 2 replies • view on HN

I have 32GB of RAM with 16GB VRAM and I haven't had a lot of luck running larger models like this. Are you able to expand on that?

Replies

jboss10 • today at 12:18 AM

I'm running llama-swap in a docker container with nvidia container utis to pass through the GPU. This then runs the correct llama-server command to provide the model I want. I have a folder full of guff s I mount in the container.

But this could be done with just llama-server normally. I don't use any special command, just ensure that it's using the GPU. I've found the default fitting to be good.

From memory:

llama-server -m models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf -fa on -c 128000

slim • last Monday at 9:03 PM

use llama.cpp with cuda

➕ show 1 reply

alt Hacker News

Replies