If you have to ask then your GPU is too small. With 16 GB you'll be only able to run a very c...

trvz • yesterday at 2:43 PM • 4 replies • view on HN

If you have to ask then your GPU is too small.

With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.

Replies

Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on.

➕ show 1 reply

palmotea • yesterday at 2:46 PM

> If you have to ask then your GPU is too small.

What's the minimum memory you need to run a decent model? Is it pretty much only doable by people running Macs with unified memory?

➕ show 8 replies

FusionX • yesterday at 2:59 PM

Aren't 4bits model decent? Since, this is an MOE model, I'm assuming it should have respectable tk/s, similar to previous MOE models.

gunalx • yesterday at 6:31 PM

Running q3 xss with full and quantizised context as options on a 16gb gpu and still has pretty decent quality and fitting fine with up to 64k context.

alt Hacker News

Replies