OP here. I spent the last few days reproducing the mHC architecture from the recent DeepSeek paper (2512.24880).
Two key takeaways from the reproduction:
Unconstrained Hyper-Connections really do explode (7x amplification even at 10M scale).
I hit a nasty "stream persistence" bug where my tensors were the right shape, but the architecture was functionally broken.
This is Part 1 (10M scale). Part 2 (scaling to 1B on A100s) is coming later this week. Happy to answer questions about the implementation.
How do you know "GPT-5, Claude, Llama, Gemini. Under the hood, they all do the same thing: x+F(x)."?