How do get the weights for the right set of experts for a given batch of tokens into fast memory at ...

rnrn • today at 6:01 PM • 1 reply • view on HN

How do get the weights for the right set of experts for a given batch of tokens into fast memory at the right time?

The activated experts is only available after routing, at which point you need the weights immediately and will have very poor performance if they are across PCIe

Replies

zozbot234 • today at 6:17 PM

Once your model is large enough you'll have to eat the offload cost for something, and it might as well be something where most of that VRAM footprint isn't even used. For current models, inactive experts arguably fit that description best. Of course, it may be the case that shifting that part of the graph to CPU compute is a better deal than paying the CPU-to-GPU cost for the active weights and computing on GPU; that's how llama.cpp does it.

alt Hacker News

Replies