logoalt Hacker News

BoorishBearslast Sunday at 8:02 PM0 repliesview on HN

MoE expected performance = sqrt(active heads * total parameter count)

sqrt(120*5) ~= 24

GPT-OSS 120B is effectively a 24B parameter model with the speed of a much smaller model