Even with MoE you still need enough memory to load all experts. For each token, only 8 experts (out ...

rahimnathwani • yesterday at 12:34 AM • 1 reply • view on HN

Even with MoE you still need enough memory to load all experts. For each token, only 8 experts (out of 256) are activated, but which experts are chosen changes dynamically based on the input. This means you'll be constantly loading and unloading experts from disk.

MoEs is great for distributed deployments, because you can maintain a distribution of experts that matches your workload, and you can try to saturate each expert and thereby saturate each node.

Replies

zozbot234 • yesterday at 12:44 AM

Loading and unloading data from disk is highly preferable to sending the same amount of data over a bottlenecked Thunderbolt 5 connection.

➕ show 1 reply

alt Hacker News

Replies