MoE expected performance = sqrt(active heads * total parameter count) sqrt(120*5) ~= 24 GPT-OSS ...

BoorishBears • last Sunday at 8:02 PM • 0 replies • view on HN

MoE expected performance = sqrt(active heads * total parameter count)

sqrt(120*5) ~= 24

GPT-OSS 120B is effectively a 24B parameter model with the speed of a much smaller model

alt Hacker News