logoalt Hacker News

palmoteayesterday at 2:40 PM5 repliesview on HN

How much VRAM does it need? I haven't run a local model yet, but I did recently pick up a 16GB GPU, before they were discontinued.


Replies

WithinReasonyesterday at 2:49 PM

It's on the page:

  Precision  Quantization Tag File Size
  1-bit      UD-IQ1_M         10 GB
  2-bit      UD-IQ2_XXS       10.8 GB
             UD-Q2_K_XL       12.3 GB
  3-bit      UD-IQ3_XXS       13.2 GB
             UD-Q3_K_XL       16.8 GB
  4-bit      UD-IQ4_XS        17.7 GB
             UD-Q4_K_XL       22.4 GB
  5-bit      UD-Q5_K_XL       26.6 GB
  16-bit     BF16             69.4 GB
show 4 replies
tommy_axleyesterday at 3:34 PM

Pick a decent quant (4-6KM) then use llama-fit-params and try it yourself to see if it's giving you what you need.

show 1 reply
zozbot234yesterday at 2:49 PM

Should run just fine with CPU-MoE and mmap, but inference might be a bit slow if you have little RAM.

Ladiossyesterday at 3:33 PM

You can run 25-30b model easily if you use Q3 or Q4 quants and llama-server with a pretty long list of options.

trvzyesterday at 2:43 PM

If you have to ask then your GPU is too small.

With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.

show 4 replies