logoalt Hacker News

moffkalasttoday at 3:58 PM2 repliesview on HN

42B active params, sliding window attention. There's your tradeoff.


Replies

vlovich123today at 4:04 PM

Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.

show 1 reply
bearjawstoday at 4:19 PM

Given how "smart" some of the 26b dense models are now, I would not be surprised to see a strong 40b MoE.