> "For tests of the type ran here, there appear to be major diminishing returns past Q4"
These statements are silly, because the only interesting comparison is among models with highly comparable on-disk sizes, or sizes for their active parameters. Obviously, a Q4 model is not going to be the same effectiveness as a Q6: no one sensibly expects that, you need to compare the Q4 with a smaller model. (The GP has the same problem of course.) I believe that once you do that kind of comparison, higher quantizations tend to do better up to Q2 or so for casual chat, maybe slightly more bits-per-param for agentic use cases where avoiding erratic behavior is important.