If you have to ask then your GPU is too small.
With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.
> If you have to ask then your GPU is too small.
What's the minimum memory you need to run a decent model? Is it pretty much only doable by people running Macs with unified memory?
Aren't 4bits model decent? Since, this is an MOE model, I'm assuming it should have respectable tk/s, similar to previous MOE models.
Running q3 xss with full and quantizised context as options on a 16gb gpu and still has pretty decent quality and fitting fine with up to 64k context.
Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on.