Fwiw, with its predecessor's Qwen3.5-35B-A3B-Q6_K.gguf, on a laptop's 6 GB VRAM and 32 GB ...

mncharity • yesterday at 7:37 PM • 1 reply • view on HN

Fwiw, with its predecessor's Qwen3.5-35B-A3B-Q6_K.gguf, on a laptop's 6 GB VRAM and 32 GB RAM, with default llama.cpp settings, I get 20 t/s generation.

Replies

rubiquity • yesterday at 8:39 PM

Have you tried running llama.cpp with Unified Memory Access[1] so your iGPU can seamlessly grab some of the RAM? The environment variable is prefixed with CUDA but this is not CUDA specific. It made a pretty significant difference (> 40% tg/s) on my Ryzen 7840U laptop.

1 - https://github.com/ggml-org/llama.cpp/blob/master/docs/build...

➕ show 2 replies

alt Hacker News

Replies