logoalt Hacker News

code_biologisttoday at 12:31 AM1 replyview on HN

What is "lower quant"? What is "higher quant"? I mean, I know what they are, but the very people you intend to reach don't know the difference between Q4_K_M and Q6_K and blog posts like [1] have nuggets like "For tests of the type ran here, there appear to be major diminishing returns past Q4".

[1] https://big-stupid-jellyfish.github.io/GFMath/pages/llm-quan...


Replies

zozbot234today at 6:42 AM

> "For tests of the type ran here, there appear to be major diminishing returns past Q4"

These statements are silly, because the only interesting comparison is among models with highly comparable on-disk sizes, or sizes for their active parameters. Obviously, a Q4 model is not going to be the same effectiveness as a Q6: no one sensibly expects that, you need to compare the Q4 with a smaller model. (The GP has the same problem of course.) I believe that once you do that kind of comparison, higher quantizations tend to do better up to Q2 or so for casual chat, maybe slightly more bits-per-param for agentic use cases where avoiding erratic behavior is important.