> if you want to understand the effects of quantization on model quality, it's really easy to spin up a GPU server instance and play around
Fwiw, not necessarily. I've noticed quantized models have strange and surprising failure modes where everything seems to be working well and then does a death spiral repeating a specific word or completely failing on one task of a handful of similar tasks.
8-bit vs 4-bit can be almost imperceptible or night and day.
This isn't something you'd necessarily see playing around, but when trying to do something specific