On Windows or Linux you can run from RAM or split layers between RAM and VRAM; running fully on GPU ...

dragonwriter • 01/21/2025 • 1 reply • view on HN

On Windows or Linux you can run from RAM or split layers between RAM and VRAM; running fully on GPU is faster than either of those, but the limit on what you can run at all isn’t VRAM.

Replies

akhdanfadh • 01/22/2025

So is it possible to load the ollama deepseek-r1 70b (43gb) model on my 24gb vram + 32gb ram machine? Does this depend on how I load the model, i.e., with ollama instead of other alternatives? Afaik, ollama is basically llama.cpp wrapper.

I have tried to deploy one myself with openwebui+ollama but only for small LLM. Not sure about the bigger one, worried if that will crash my machine someway. Are there any docs? I am curious about this and how that works if any.

alt Hacker News

Replies