Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.
> You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.
Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?
> You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.
Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?