Multi-head Latent Attention (MLA) + DeepSeekMoE? plus an auxiliary loss free load balancing strategy and multi token prediction objective to train/infer huge MoE models efficiently.
Have you seen Manifold Constrained Hyper Connections (mHC) paper from a few days ago from Deepseek? Projects residual connection space onto a constrained manifold to keep identity mapping properties while enabling richer internal connectivity, so basically it eliminates a huge problem.
They also released A LOT of training tricks and innovation around optimizing inference and training.
As to other industries:
"China leads research in 90% of crucial technologies — a dramatic shift this century" [1]
And here's[2] "China Is Rapidly Becoming a Leading Innovator in Advanced Industries", a big report on where they lead and how.
Multi-head Latent Attention (MLA) + DeepSeekMoE? plus an auxiliary loss free load balancing strategy and multi token prediction objective to train/infer huge MoE models efficiently.
Have you seen Manifold Constrained Hyper Connections (mHC) paper from a few days ago from Deepseek? Projects residual connection space onto a constrained manifold to keep identity mapping properties while enabling richer internal connectivity, so basically it eliminates a huge problem.
They also released A LOT of training tricks and innovation around optimizing inference and training.
As to other industries:
"China leads research in 90% of crucial technologies — a dramatic shift this century" [1]
And here's[2] "China Is Rapidly Becoming a Leading Innovator in Advanced Industries", a big report on where they lead and how.
1. https://www.nature.com/articles/d41586-025-04048-7
2. https://itif.org/publications/2024/09/16/china-is-rapidly-be...