logoalt Hacker News

0xbadcafebeetoday at 12:46 AM1 replyview on HN

Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.


Replies

MuffinFlavoredtoday at 1:23 AM

> You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?

show 1 reply