Modern GPUs aren't optimized for MoEs though? The advantage to a dense model like this Mistra...

ac29 • yesterday at 6:02 PM • 1 reply • view on HN

Modern GPUs aren't optimized for MoEs though?

The advantage to a dense model like this Mistral one is that it is as smart as a much larger MoE model so it can fit on less GPUs. The tradeoff is that it is much slower since it has to read 100% of its weights for every token, MoE models typically only read about a tenth (though sparsity levels vary).

Replies

schipperai • yesterday at 11:34 PM

Thanks, makes sense. I meant Blackwell is explicitly optimized for MoEs.

alt Hacker News

Replies