This is a fantastic catch. I hadn't realized Gemma 3n was already shipping with a variant of this in production.
It feels like we are entering the era of residual stream engineering. For a long time, the standard x + F(x) additive backbone was treated as untouchable. Now, between mHC (weighted scaling) and LAuReL (low-rank projections), labs are finally finding stable ways to make that signal path more dynamic.
I'm curious if the Low-Rank constraint in LAuReL acts as a natural stabilizer against the gradient explosion I saw with unconstrained hyper-connections.
Thanks for the paper link, definitely reading that tonight.
Thanks! Would be quite interesting to see how this fares compared to mHC.
I noted that LAuReL is cited in the mHC paper, but they refer to it as "expanding the width of the residual stream", which is rather odd.