That's interesting.
I suspect your intuition about scale is correct. The theoretical benefit of mHC is that it acts as a sort of relief valve/router for information flow in very deep/wide networks where the standard residual bottleneck becomes an issue. At 8M params, the standard residual stream is likely already perfectly adequate, so mHC might just be adding parameter overhead without solving a real signal propagation problem yet.
Quick question on your run: did you see the signal amplification/instability I saw (values growing during the forward pass)? or was it stable for you, just neutral on loss?
My baseline was non-HC "vanilla" residuals; I didn't do a meaningful HC run to compare.
My application has some particularities (important and easy to identify per-token signals) that result in values growing (about 3x to 10x) through layers even in the baseline.
> Quick question on your run: did you see the signal amplification/instability I saw (values growing during the forward pass)? or was it stable for you, just neutral on loss?
I think your brain may have been taken over by ChatGPT.