42B active params, sliding window attention. There's your tradeoff.
Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.
Given how "smart" some of the 26b dense models are now, I would not be surprised to see a strong 40b MoE.
Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.