It's nice to see more focus on efficiency. All the recent new model releases have come along with massive jumps in certain benchmarks but when you dig into it it's almost always paired with a massive increase in token usage to achieve those results (ahem Google Deep Think ahem). For AI to truly be transformational it needs to solve the electricity problem
And not just token usage, expensive token usage; when it comes to tokens/joule not all tokens are equal. Efficient use of MoE architectures does have an impact on tokens/joule and tokens/sec.