While cool, quantization to FP4 is practically never lossless in actual use. A lot of providers are advertising high TPS on Kimi and GLM, but the models are functionally lobotomized and no longer close to frontier quality. Would love to see this not be true.
MI355X can perform FP6 operations with the same speed as their FP4 (unique to AMD) - people should be making MXFP6 quants which would be pretty much lossless, and much closer to FP4 performance than FP8
First thing I noticed as well
Kimi uses INT4 as its native format, there's no such thing as "better than 4-bit precision" for that model. This is in contrast with GLM for which 16-bit precision is native and 8-bit is in common use.