42B active params, sliding window attention. There's your tradeoff.

moffkalast • today at 3:58 PM • 2 replies • view on HN

Replies

Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.

➕ show 1 reply

bearjaws • today at 4:19 PM

Given how "smart" some of the 26b dense models are now, I would not be surprised to see a strong 40b MoE.

alt Hacker News

Replies