logoalt Hacker News

schipperaiyesterday at 5:33 PM1 replyview on HN

With most OSS releases being MoEs, and modern GPUs optimized for MoEs, can somebody with knowledge of the topic explain or speculate why Mistral might have opted for a dense model?


Replies

ac29yesterday at 6:02 PM

Modern GPUs aren't optimized for MoEs though?

The advantage to a dense model like this Mistral one is that it is as smart as a much larger MoE model so it can fit on less GPUs. The tradeoff is that it is much slower since it has to read 100% of its weights for every token, MoE models typically only read about a tenth (though sparsity levels vary).