Most benchmarks show very little improvement of "full quality" over a quantized lower-bit ...

0xbadcafebee • today at 12:46 AM • 1 reply • view on HN

Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

Replies

MuffinFlavored • today at 1:23 AM

> You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?

➕ show 1 reply

alt Hacker News

Replies