logoalt Hacker News

zozbot234today at 2:30 PM0 repliesview on HN

> Well, the actual inference providers put each expert on its own single GPU.

Experts are usually chosen on a per-layer basis, not just by token, so I'd think this requires having lots of GPU's to make it worthwhile. You could do it with a single physical GPU by switching expert-layer mixes in a round-robin fashion after the batch for any single expert-layer mix is completed (essentially a refined version of expert offloading). But still, not easy.