If you are using quants below Q8 then get them from Unsloth or Bartowski. They are higher quality ...

suprjami • today at 6:41 AM • 1 reply • view on HN

If you are using quants below Q8 then get them from Unsloth or Bartowski.

They are higher quality than the quants you can make yourself due to their imatrix datasets and selective quantisation of different parts of the model.

For Qwen 3.5 Unsloth did 9 terabytes of quants to benchmark the effects of this:

https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

Replies

moffkalast • today at 8:34 AM

That used to be a good suggestion, and it still most likely is if you're using a recent Nvidia dGPU, but absolutely not for iGPUs like the Halo/Point or Arc LPG. The problem is bf16.

In short, even lower quants leave some layers at original precision and llama.cpp in its endless wisdom does not do any conversion when loading weights and seeing what your card supports, so every time you run inference it gets so surprised and hits a brick wall when there's no bf16 acceleration. Then it has to convert to fp16 on the fly or something else which can literally drop tg by half or even more. I've seen fp16 models literally run faster than Q8 on Arc despite being twice the size with the same bandwidth and it's expectedly similar [0] on AMD.

Models used to be released as fp16 which was fine, then Gemma did native bf16 and Bartowski initially came up with a compatibility thing where they converted bf16 to fp32 then fp16 and used that for quants. Most models are released as bf16 these days though and Bartowski's given up on doing that (while Unsloth never did that to begin with). So if you do want max speed, you kinda have to do static quants yourself and follow the same multi-step process to remove all the stupid bf16 weights from the model. I don't get why this can't be done once at model load ffs, but this is what we've got.

[0] https://old.reddit.com/r/LocalLLaMA/comments/1r0b7p8/free_st...

➕ show 1 reply

alt Hacker News

Replies