How much VRAM do you need to achieve decent performance?

klaussilveira • yesterday at 3:31 PM • 1 reply • view on HN

Replies

I have a 64GB M1 Ultra dedicated to llama.cpp. I get 40 tok/s on a fresh session, decreasing slowly to about 25 tok/s at around 50% of the 256K context, then down to 20 tok/s or less beyond that, but I rarely let it go much higher and handoff instead. This is whith Qwen 36B A3B at 8Q without KV quantization. It's not super fast but perfectly usable for me.

alt Hacker News

Replies