logoalt Hacker News

MuffinFlavoredtoday at 1:23 AM1 replyview on HN

> You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?


Replies

bigyabaitoday at 1:33 AM

> Are there a lot of options how "how far" do you quantize?

So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

> How much VRAM does it take to get the 92-95% you are speaking of?

For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.

show 1 reply