> I struggle to comprehend how an odd quantization like 5 bit, that doesn't align well with 8 bit boundaries, would not slow things down for inference
Who says it doesn’t :)?
At least in my tests there is a big penalty to using an “odd” bit stride.
Testing 4bit quantization vs 5bit in Llama.cpp, I see quite a bit more than the “naiively expected” 25% slowdown from 4 to 5 bits.