I guess I am asking how we know Gemini and Claude relies on the additive residual stream. We don...

WiSaGaN • yesterday at 2:54 PM • 1 reply • view on HN

I guess I am asking how we know Gemini and Claude relies on the additive residual stream. We don't know the architecture details for these closed models?

Replies

taykolasinski • yesterday at 2:58 PM

That's a fair point. We don't have the weights or code for the closed models, so we can't be 100% certain.

However, transformer-based (which their technical reports confirm they are) implies the standard pre-norm/post-nnorm residual block structure. Without those additive residual connections, training networks of that depth (100+ layers) becomes difficult due to the vanishing gradient problem.

If they had solved deep signal propagation without residual streams, that would likely be a bigger architectural breakthrough than the model itself (akin to Mamba/SSMs). It’s a very high-confidence assumption, but you are right that it is still an assumption.

alt Hacker News

Replies