logoalt Hacker News

Scene_Cast2yesterday at 4:22 PM1 replyview on HN

I implemented this for a toy 8M ViT-style model. Got neutral results. This is just an anecdote and is not representative - I think mHC will help with larger parameter sizes and larger token counts.


Replies

taykolasinskiyesterday at 4:28 PM

That's interesting.

I suspect your intuition about scale is correct. The theoretical benefit of mHC is that it acts as a sort of relief valve/router for information flow in very deep/wide networks where the standard residual bottleneck becomes an issue. At 8M params, the standard residual stream is likely already perfectly adequate, so mHC might just be adding parameter overhead without solving a real signal propagation problem yet.

Quick question on your run: did you see the signal amplification/instability I saw (values growing during the forward pass)? or was it stable for you, just neutral on loss?

show 2 replies