logoalt Hacker News

ekojsyesterday at 3:41 PM3 repliesview on HN

As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless. With that, you can run this on a 3090/4090/5090. You can probably even go FP8 with 5090 (though there will be tradeoffs). Probably ~70 tok/s on a 5090 and roughly half that on a 4090/3090. With speculative decoding, you can get even faster (2-3x I'd say). Pretty amazing what you can get locally.


Replies

Aurornisyesterday at 3:45 PM

> As this is a dense model and it's pretty sizable, 4-bit quantization can be nearly lossless

The 4-bit quants are far from lossless. The effects show up more on longer context problems.

> You can probably even go FP8 with 5090 (though there will be tradeoffs)

You cannot run these models at 8-bit on a 32GB card because you need space for context. Typically it would be Q5 on a 32GB card to fit context lengths needed for anything other than short answers.

show 3 replies
zozbot234yesterday at 3:48 PM

4-bit quantization is almost never lossless especially for agentic work, it's the lowest end of what's reasonable. It's advocated as preferable to a model with fewer parameters that's been quantized with more precision.

show 1 reply
binary132yesterday at 3:42 PM

That seems awfully speculative without at least some anecdata to back it up.

show 2 replies