logoalt Hacker News

netdurtoday at 12:56 PM6 repliesview on HN

hmm... at Q4_K_M, stock-style quantization is retaining ~99–99.8% of BF16 accuracy, AutoRound pushes that to ~99.4–100.n% (??) the gap is roughly 0.1–0.7 percentage points

https://github.com/intel/auto-round/blob/main/docs/gguf_alg_...


Replies

rhdunntoday at 4:09 PM

My experience is that at Q5 and lower you start to see noticeable degredation in performance/quality. It's especially noticeable at Q4 where models will easily get trapped in repeating token loops. I generally use Q6.

[1] https://medium.com/@paul.ilvez/demystifying-llm-quantization...

show 1 reply
woadwarrior01today at 3:54 PM

> at Q4_K_M, stock-style quantization is retaining ~99–99.8% of BF16 accuracy

That's a tall claim. By that measure, even NVIDIA's QAD, which is AFAIK is currently SOTA for 4-bit quantization (albeit NVFP4 instead of INT4) would be worse than Q4_K_M RTN quantization. :D

https://arxiv.org/abs/2601.20088

NitpickLawyertoday at 2:43 PM

> at Q4_K_M, stock-style quantization is retaining ~99–99.8% of BF16 accuracy

I call bs on that. Not even FP8 is 99.8 in every scenario. It's close, but not quite bit exact, and to say that you reach 99% with q4 is a stretch. Maybe if all you test is really old benchmark questions that are in every training set out there, but go a bit ood and you'll see your q4 crumble. Try coding in a niche language or something. Or long context math (not 1+2 from the MATH benchmark) not in aime sets, and you'll get a few percentages of accuracy loss for each quant step.

bee_ridertoday at 2:35 PM

Because the accuracy loss is pretty small in both cases, that’s still a pretty big relative improvement. I mean it’s around twice as good, right? I’m not sure how to interpret these percentage points from a usability point of view, though.

muyuutoday at 2:50 PM

taking x20 to x40 the time RTN took? looking at the table at the bottom

if so that's a pretty drastic trade-off

Der_Einzigetoday at 2:08 PM

Claims of "we preserve 99.9999%" of accuracy are made in practically every quantization paper. The whole subfield acts like it's totally fine that they are testing on datasets that these models have fully trained on.

If we were in any other subfield doing this would be considered cheating and get your paper rejected, but the quantization community really loves to spread FUD claiming that quantization doesn't harm models

Also, similar dynamic with dense vs sparse MoE models. There's a reason we keep getting dense model releases alongside the MoEs out of China.

Quantization is not free, causes significant brain damage (especially on very long contexts), and has enough academic misconduct within it that it's actively screwing up the market. Don't believe me? Go ask your local financial analyst about the markets reaction to TurboQuant and than try to square that circle with this: https://openreview.net/forum?id=tO3ASKZlok (extreme and credible allegations of academic misconduct/fraud)

show 3 replies