How do you know "GPT-5, Claude, Llama, Gemini. Under the hood, they all do the same thing: x+F(...

WiSaGaN • yesterday at 2:48 PM • 1 reply • view on HN

How do you know "GPT-5, Claude, Llama, Gemini. Under the hood, they all do the same thing: x+F(x)."?

Replies

taykolasinski • yesterday at 2:51 PM

I’m referring specifically to the fundamental residual connection backbone that defines the transformer architecture (x_{l+1} = x_l + F(x_l)).

While the sub-modules differ (MHA vs GQA, SwiGLU vs GeLU, Mixture-of-Depths, etc.), the core signal propagation in Llama, Gemini, and Claude relies on that additive residual stream.

My point here is that DeepSeek's mHC challenges that fundamental additive assumption by introducing learnable weighted scaling factors to the residual path itself.

➕ show 1 reply

alt Hacker News

Replies