logoalt Hacker News

trvzyesterday at 2:43 PM4 repliesview on HN

If you have to ask then your GPU is too small.

With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.


Replies

coder543yesterday at 3:17 PM

Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on.

show 1 reply
palmoteayesterday at 2:46 PM

> If you have to ask then your GPU is too small.

What's the minimum memory you need to run a decent model? Is it pretty much only doable by people running Macs with unified memory?

show 8 replies
FusionXyesterday at 2:59 PM

Aren't 4bits model decent? Since, this is an MOE model, I'm assuming it should have respectable tk/s, similar to previous MOE models.

gunalxyesterday at 6:31 PM

Running q3 xss with full and quantizised context as options on a 16gb gpu and still has pretty decent quality and fitting fine with up to 64k context.