OP here. I spent the last few days reproducing the mHC architecture from the recent DeepSeek paper (2512.24880).
Two key takeaways from the reproduction:
Unconstrained Hyper-Connections really do explode (7x amplification even at 10M scale).
I hit a nasty "stream persistence" bug where my tensors were the right shape, but the architecture was functionally broken.
This is Part 1 (10M scale). Part 2 (scaling to 1B on A100s) is coming later this week. Happy to answer questions about the implementation.
What's the advantage of having multiple channels with separate residual connections? Why not just concatenate those channels, and do residual connections on the concatenated channel?
Why can't you just leave H_res as the identity matrix (or just not use it at all)? In that case, the model is basically a ResNet again and you don't need to worry about exploding/vanishing gradients from H_res.
I would think that H_post and H_pre could cover the lost expressiveness.
great write up. it's been a while since I had the pleasure to read a straightforward blog post about ML tricks that feel genuinely applicable to many use cases.
Between the clear writing and the diagrams, this was a great write up. I had actually skipped reading up on mHC as it sounded like it was going to take some time to grok, but this made it immediately approachable. I hope you do more write ups like this in the future.
I implemented this for a toy 8M ViT-style model. Got neutral results. This is just an anecdote and is not representative - I think mHC will help with larger parameter sizes and larger token counts.
I’ve been wondering for a while: Why isn’t this architecture more common in other LLMs? The context efficiency is amazing, after all - doesn’t that translate to a lot of money at scale?
Nice visualization of the residual connections. Is the animated svg manually created or programmatically generated? What tools did you use?
May be worth pointing out, that this is not the first residual connection innovation to be in production.
Gemma 3n is also using a low-rank projection of the residual stream called LAuReL. Google did not publicize this too much, I noted it when poking around in the model file.
https://arxiv.org/pdf/2411.07501v3
https://old.reddit.com/r/LocalLLaMA/comments/1kuy45r/gemma_3...
Seems to be what they call LAuReL-LR in the paper, with D=2048 and R=64