As a baseline, I run all models in Q8 [0] because I want to be confident that when I observe a probl...

ggerganov • today at 4:31 PM • 1 reply • view on HN

As a baseline, I run all models in Q8 [0] because I want to be confident that when I observe a problem, the root cause is not due to the quantization. However, in this specific case, I use Q8 on the mac and Q4 on the RTX machine because the latter does not fit the full context at Q8. So far, I don't have conclusive evidence that the Q4 quantization affects the quality in a significant way for this model and the tasks that I am using it for.

[0] https://huggingface.co/ggerganov/presets/blob/main/preset.in...

Replies

girvo • today at 10:27 PM

27B seems surprisingly resiliant to quantisation. Though my evals showed there was some impact to coding ability from 8 bit to 4 bit, it was less than I would've expected: and it was on task types that you've said above that you don't really do with these!

alt Hacker News

Replies