logoalt Hacker News

mike_hearnlast Monday at 12:48 PM1 replyview on HN

I'm really not a PyTorch expert so this is most likely a newbie error, but could someone explain to me the code in Figure 7?

The code circled as "4 x emb_dim" doesn't seem to apply a 4x multiplier anywhere. Actually, the layer definitions of fc1 and fc2 in the SwiGLU variant appear to be identical to the code in the regular feed forward block. What is making the two layers in the second code snippet different sizes to fc1 in the first?


Replies

fdalvilast Monday at 1:00 PM

It is indeed not something clarified by the code snippets; In normal feedforward layers, it is common to choose the "hidden_dim = 4 x emb_dim", while in GLU feedforward layer, the convention is to use "hidden_dim = 2/3 * regular_ffn_hidden_dim" (to keep the overall number of parameters roughly the same). In the case of gpt-oss, they chose to go a bit more extreme and set "hidden_dim = emb_dim", thus reducing the overall number of parameters!

show 1 reply