Have you tried running llama.cpp with Unified Memory Access[1] so your iGPU can seamlessly grab some...

rubiquity • yesterday at 8:39 PM • 2 replies • view on HN

Have you tried running llama.cpp with Unified Memory Access[1] so your iGPU can seamlessly grab some of the RAM? The environment variable is prefixed with CUDA but this is not CUDA specific. It made a pretty significant difference (> 40% tg/s) on my Ryzen 7840U laptop.

1 - https://github.com/ggml-org/llama.cpp/blob/master/docs/build...

Replies

zozbot234 • yesterday at 8:49 PM

Your link seems to be describing a runtime environment variable, it doesn't need a separate build from source. I'm not sure though (1) why this info is in build.md which should be specific to the building process, rather than some separate documentation; and (2) if this really isn't CUDA-specific, why the canonical GGML variable name isn't GGML_ENABLE_UNIFIED_MEMORY , with the _CUDA_ variant treated as a legacy alias. AIUI, both of these should be addressed with pull requests for llama.cpp and/or the ggml library itself.

➕ show 1 reply

mncharity • yesterday at 10:16 PM

I hadn't tried that, thanks! I found simply defining GGML_CUDA_ENABLE_UNIFIED_MEMORY, whether 1, 0, or "", was a 10x hit to 2 t/s. Perhaps because the laptop's RAM is already so over-committed there. But with the much smaller 4B Qwen3.5-4B-Q8_0.gguf, it doubled performance from 20 to 40+ t/s! Tnx! (an old Quadro RTX 3000 rather than an iGPU)

alt Hacker News

Replies