You call it fair? 32 / 5.1 > 6, it's takes 6 times more to compute each token. Put it ...

Mars008 • last Monday at 2:15 AM • 3 replies • view on HN

You call it fair? 32 / 5.1 > 6, it's takes 6 times more to compute each token. Put it other way, Qwen3 32B is 6 times slower than GPT OSS 120B.

Replies

kgeist • last Monday at 2:22 AM

>Qwen3 32B is 6 times slower than GPT OSS 120B.

Only if 120B fits entirely in the GPU. Otherwise, for me, with a consumer GPU that only has 32 GB VRAM, gpt-oss 120B is actually 2 times slower than Qwen3 32B (37 tok/sec vs. 65 tok/sec)

selcuka • last Monday at 6:48 AM

We are talking about accuracy, though. I don't see the point of MoE if a 120B MoE model is not as accurate as even a 32B model.

littlestymaar • last Monday at 7:13 AM

I've read many times that MoE models should be comparable to dense models with a number of parameters equal to the geometric mean of the MoE's total number of parameters and active ones.

In the case of gpt-oss 120B that would means sqrt(5*120)=24B.

➕ show 3 replies

alt Hacker News

Replies