logoalt Hacker News

poormanlast Sunday at 8:59 PM2 repliesview on HN

This article really goes into a lot of detail which is nice. gpt-oss is just not good for agentic use in my observation.

tldr; I'll save you a lot of time trying things out for yourself. If you are on a >=32 GB Mac download LMStudio and then the `qwen3-coder-30b-a3b-instruct-mlx@5bit` model. It uses ~20 GB of RAM so a 32GB machine is plenty. Set it up with opencode [1] and you're off to the races! It has great tool calling ability. The tool calling ability of gpt-oss doesn't even come close in my observations.

[1] https://opencode.ai/


Replies

LarMachinarumlast Monday at 10:00 AM

Much as I understand how a 5 bit quantization might be a sweet spot in the tradeoff between precision and making it possible to cram more weight parameters into limited ram, and thus in that respect better than e.g. 4 bit or 8 bit,…

…I struggle to comprehend how an odd quantization like 5 bit, that doesn't align well with 8 bit boundaries, would not slow things down for inference: given that on one hand the hardware doing the multiplications doesn't support vectors of 5 bit values but needs repacking to 8 bit before multiplication, and on the other hand the weights can't be bulk-repacked to 8 bit once and for all in advance (otherwise it wouldn't fit inside the RAM, besides in that case one would use a 8 bit quantization anyways)

it would require quite a lot of instructions per multiplication (way more than for 4 bit quantization where the alignment match simplifies things) to ad-hoc repack the 5 bit values to vectors of 8 bit. So i kinda wonder how much (percentage-wise) that would impact inference performance

show 1 reply
ModelForgelast Sunday at 9:50 PM

The ollama one uses even less (around 13 GB), which is nice. Apparently the gpt-oss team also shared the mxfp4 optimizations for metal