What is the state of using quants? For chat models, a few errors or lost intelligence may matter a little. But what is happening to tool calling in coding agents? Does it fail catastrophically after a few steps in the agent?
I am interesting if I can run it on a 24GB RTX 4090.
Also, would vllm be a good option?
I like the byteshape quantizations - they are dynamic variable quantization weights that are tuned for quality vs overall size. They seem to make less errors at lower "average" quantizations than the unsloth 4 bit quants. I think this is similar to variable bitrate video compression where you can keep higher bits where it helps overall model accuracy.
Should be able to run this in 22GB vram so your 4090 (and a 3090) would be safe. This model also uses MLA so you can run pretty large context windows without eating up a ton of extra vram.
edit: 19GB vram for a Q4_K_M - MLX4 is around 21GB so you should be clear to run a lower quant version on the 4090. Full BF16 is close to 60GB so probably not viable.