It is indeed not something clarified by the code snippets; In normal feedforward layers, it is commo...

fdalvi • last Monday at 1:00 PM • 1 reply • view on HN

It is indeed not something clarified by the code snippets; In normal feedforward layers, it is common to choose the "hidden_dim = 4 x emb_dim", while in GLU feedforward layer, the convention is to use "hidden_dim = 2/3 * regular_ffn_hidden_dim" (to keep the overall number of parameters roughly the same). In the case of gpt-oss, they chose to go a bit more extreme and set "hidden_dim = emb_dim", thus reducing the overall number of parameters!

Replies

mike_hearn • last Monday at 2:10 PM

Ah, thank you!

alt Hacker News

Replies