logoalt Hacker News

Reproducing DeepSeek's MHC: When Residual Connections Explode

96 pointsby taykolasinskiyesterday at 1:57 PM30 commentsview on HN

Comments

cpldcpuyesterday at 3:49 PM

May be worth pointing out, that this is not the first residual connection innovation to be in production.

Gemma 3n is also using a low-rank projection of the residual stream called LAuReL. Google did not publicize this too much, I noted it when poking around in the model file.

https://arxiv.org/pdf/2411.07501v3

https://old.reddit.com/r/LocalLLaMA/comments/1kuy45r/gemma_3...

Seems to be what they call LAuReL-LR in the paper, with D=2048 and R=64

show 1 reply
taykolasinskiyesterday at 2:09 PM

OP here. I spent the last few days reproducing the mHC architecture from the recent DeepSeek paper (2512.24880).

Two key takeaways from the reproduction:

Unconstrained Hyper-Connections really do explode (7x amplification even at 10M scale).

I hit a nasty "stream persistence" bug where my tensors were the right shape, but the architecture was functionally broken.

This is Part 1 (10M scale). Part 2 (scaling to 1B on A100s) is coming later this week. Happy to answer questions about the implementation.

show 1 reply
AlexCoventryyesterday at 11:01 PM

What's the advantage of having multiple channels with separate residual connections? Why not just concatenate those channels, and do residual connections on the concatenated channel?

in-silicoyesterday at 7:16 PM

Why can't you just leave H_res as the identity matrix (or just not use it at all)? In that case, the model is basically a ResNet again and you don't need to worry about exploding/vanishing gradients from H_res.

I would think that H_post and H_pre could cover the lost expressiveness.

john-titoryesterday at 10:06 PM

great write up. it's been a while since I had the pleasure to read a straightforward blog post about ML tricks that feel genuinely applicable to many use cases.

theschwayesterday at 4:57 PM

Between the clear writing and the diagrams, this was a great write up. I had actually skipped reading up on mHC as it sounded like it was going to take some time to grok, but this made it immediately approachable. I hope you do more write ups like this in the future.

show 1 reply
Scene_Cast2yesterday at 4:22 PM

I implemented this for a toy 8M ViT-style model. Got neutral results. This is just an anecdote and is not representative - I think mHC will help with larger parameter sizes and larger token counts.

show 1 reply
solarkraftyesterday at 2:48 PM

I’ve been wondering for a while: Why isn’t this architecture more common in other LLMs? The context efficiency is amazing, after all - doesn’t that translate to a lot of money at scale?

show 3 replies
sbondaryevyesterday at 3:47 PM

Nice visualization of the residual connections. Is the animated svg manually created or programmatically generated? What tools did you use?

show 1 reply