> I get the feeling that it was trained very differently from the other models
It's actually based on a deepseek architecture just bigger size experts if I recall correctly.
It was notably trained with Muon optimizer for what it's worth, but I don't know how much can be attributed to that alone
As far as I'm aware, they all are. There are only five important foundation models in play -- Gemini, GPT, X.ai, Claude, and Deepseek. (edit: forgot Claude)
Everything from China is downstream of Deepseek, which some have argued is basically a protege of ChatGPT.