MoE expected performance = sqrt(active heads * total parameter count)
sqrt(120*5) ~= 24
GPT-OSS 120B is effectively a 24B parameter model with the speed of a much smaller model