logoalt Hacker News

taykolasinskiyesterday at 4:28 PM2 repliesview on HN

That's interesting.

I suspect your intuition about scale is correct. The theoretical benefit of mHC is that it acts as a sort of relief valve/router for information flow in very deep/wide networks where the standard residual bottleneck becomes an issue. At 8M params, the standard residual stream is likely already perfectly adequate, so mHC might just be adding parameter overhead without solving a real signal propagation problem yet.

Quick question on your run: did you see the signal amplification/instability I saw (values growing during the forward pass)? or was it stable for you, just neutral on loss?


Replies

astrangeyesterday at 11:55 PM

> Quick question on your run: did you see the signal amplification/instability I saw (values growing during the forward pass)? or was it stable for you, just neutral on loss?

I think your brain may have been taken over by ChatGPT.

Scene_Cast2yesterday at 6:47 PM

My baseline was non-HC "vanilla" residuals; I didn't do a meaningful HC run to compare.

My application has some particularities (important and easy to identify per-token signals) that result in values growing (about 3x to 10x) through layers even in the baseline.