Much as I understand how a 5 bit quantization might be a sweet spot in the tradeoff between precision and making it possible to cram more weight parameters into limited ram, and thus in that respect better than e.g. 4 bit or 8 bit,…
…I struggle to comprehend how an odd quantization like 5 bit, that doesn't align well with 8 bit boundaries, would not slow things down for inference: given that on one hand the hardware doing the multiplications doesn't support vectors of 5 bit values but needs repacking to 8 bit before multiplication, and on the other hand the weights can't be bulk-repacked to 8 bit once and for all in advance (otherwise it wouldn't fit inside the RAM, besides in that case one would use a 8 bit quantization anyways)
it would require quite a lot of instructions per multiplication (way more than for 4 bit quantization where the alignment match simplifies things) to ad-hoc repack the 5 bit values to vectors of 8 bit. So i kinda wonder how much (percentage-wise) that would impact inference performance
> I struggle to comprehend how an odd quantization like 5 bit, that doesn't align well with 8 bit boundaries, would not slow things down for inference
Who says it doesn’t :)?
At least in my tests there is a big penalty to using an “odd” bit stride.
Testing 4bit quantization vs 5bit in Llama.cpp, I see quite a bit more than the “naiively expected” 25% slowdown from 4 to 5 bits.