logoalt Hacker News

vlovich123today at 4:04 PM1 replyview on HN

Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.


Replies

moffkalasttoday at 4:27 PM

Seems to be for both according to the spec [0], maybe it's wrong though.

128 sounds really tiny, I wonder if they mean some kind of blocks?

[0] https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#4...

show 1 reply