You can only run heavily quantized models on all 3/4/5 rtx gpus (with 32gb or less vram) -...

sheeshkebab • today at 12:25 AM • 2 replies • view on HN

You can only run heavily quantized models on all 3/4/5 rtx gpus (with 32gb or less vram) - and you probably want moe versions like Qwen 35b for this to run at speed somewhat comparable to Claude. It’s still not there to be honest but getting there. Personally I mess around with llama.cpp on m5 max with 128gb - it’s a decent setup to try various medium sized things, and runs llms surprisingly well without quantization, at least the moe models.

Replies

akulbe • today at 4:55 AM

How is that machine for local inference? It's a serious consideration for me, but getting to hear more from folks that already have it would be helpful.

SwellJoe • today at 12:37 AM

Two 3090s is 48GB, so it's possible to run the 6-bit quantization comfortably, which is fine. It doesn't start to get notably dumber until lower than that. It won't be as fast as a hosted model, but dual 3090s will be comfortably fast for interactive use with the MoE version and not terrible to use with the dense model. I run the dense model at 8 bits on my dual Radeon V620 desktop machine, which I think would be slower than two 3090s, or at least not notably faster.

➕ show 1 reply

alt Hacker News

Replies