I implemented this for a toy 8M ViT-style model. Got neutral results. This is just an anecdote and i...

Scene_Cast2 • yesterday at 4:22 PM • 1 reply • view on HN

I implemented this for a toy 8M ViT-style model. Got neutral results. This is just an anecdote and is not representative - I think mHC will help with larger parameter sizes and larger token counts.

Replies

taykolasinski • yesterday at 4:28 PM

That's interesting.

I suspect your intuition about scale is correct. The theoretical benefit of mHC is that it acts as a sort of relief valve/router for information flow in very deep/wide networks where the standard residual bottleneck becomes an issue. At 8M params, the standard residual stream is likely already perfectly adequate, so mHC might just be adding parameter overhead without solving a real signal propagation problem yet.

Quick question on your run: did you see the signal amplification/instability I saw (values growing during the forward pass)? or was it stable for you, just neutral on loss?

➕ show 2 replies

alt Hacker News

Replies