I've read many times that MoE models should be comparable to dense models with a number of para...

littlestymaar • last Monday at 7:13 AM • 3 replies • view on HN

I've read many times that MoE models should be comparable to dense models with a number of parameters equal to the geometric mean of the MoE's total number of parameters and active ones.

In the case of gpt-oss 120B that would means sqrt(5*120)=24B.

Replies

Mars008 • last Monday at 3:52 PM

Not sure there is on formula. Because there are two different cases:

1) performance constrained. like NVidia Spark with 128GB or AGX with 64GB.

2) memory constrained. like consumers' GPUs.

In first case MoE is clear win. They fit and run faster. In second case dense models will produce better results. And if performance in token/sec is acceptable then they are better choice.

selcuka • last Monday at 12:30 PM

> In the case of gpt-oss 120B that would means sqrt(5*120)=24B.

That's actually in line with what I had (unscientifically) expected. Claude Sonnet 4 seems to agree:

> The most accurate approach for your specific 120B MoE (5.1B active) would be to test it empirically against dense models in the 10-30B range.

kgeist • last Monday at 11:20 AM

I've read that the formula is based on the early Mistral models and does not necessarily reflect what's going on nowadays.

alt Hacker News

Replies