> for a 1T model youd need to stream something like 2TB of weights per forward pass Isn't ...

zozbot234 • today at 5:46 PM • 1 reply • view on HN

> for a 1T model youd need to stream something like 2TB of weights per forward pass

Isn't this missing the point of MoE models completely? MoE inference is sparse, you only read a small fraction of the weights per layer. You still have a problem of each individual expert-layer being quite small (a few MiBs each give or take) but those reads are large enough for the NVMe.

Replies

visarga • today at 5:52 PM

But across a sequence you still have to load most of them.

alt Hacker News

Replies