It's not that surprising that an 8B dense model would compete with a 35B-A3B MoE model. The g...

meatmanek • today at 5:10 PM • 0 replies • view on HN

It's not that surprising that an 8B dense model would compete with a 35B-A3B MoE model.

The geometric mean rule of thumb for MoE models is that the intelligence level of an MoE model with T total parameters and A active parameters is roughly equivalent to that of a dense model with sqrt(A*T) parameters. For qwen3.6-35B-A3B, that equivalent size is 10.24B, spitting distance of an 8B model. Good training can make up the 28% difference in size.

alt Hacker News