I’ve been wondering for a while: Why isn’t this architecture more common in other LLMs? The context efficiency is amazing, after all - doesn’t that translate to a lot of money at scale?
I think the biggest benefit is bandwidth more so than efficiency. This gives you multiple streams to mux which and a means to control their mixing.
The biggest innovation I think may have been accidental. The doubly stochastic matrix implements conservation on the signal stream.
Treating the signal like the information it is as we do in any other domain is crucial for maintaining its coherence. We don't allow a network router to generate more packets than it receives for the same reason.
https://arxiv.org/abs/2512.24880 was published less than two weeks ago, which should explain why it's not more common yet. And it's not that amazing either. It's a slight quality improvement for a slight increase in cost. It's not even clear to me whether it pays for itself.
It's an incremental improvement, not really a revolutionary step.
That being said, I think one could adapt an existing model to add mHC by initializing the routing matrix to the regular residual connection and then post-train the hyper connection matrices. This would let you continue training more efficiently on existing models.