Qwen3 32B is a dense model, it uses all its parameters all the time. GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time. It’s a tradeoff that makes it faster to run than a dense 20B model and much smarter than a 3.6B one.
In practice the fairest comparison would be to a dense ~8B model. Qwen Coder 30B A3B is a good sparse comparison point as well.
> GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time.
They compared it to GPT OSS 120B, which activates 5.1B parameters per token. Given the size of the model it's more than fair to compare it to Qwen3 32B.
Tangential question from an outsider:
When people talk about sparse or dense models, are they spare or dense matrices in the conventional numerical linear algebra sense? (Something like a csr matrix?)