MoE experts were likely trained independently / in a sparse format. Training anything beyond 2t...

himata4113 • yesterday at 5:37 PM • 1 reply • view on HN

MoE experts were likely trained independently / in a sparse format. Training anything beyond 2t on typical systems would be infuriantingly slow, you could do 4t on nvidias room-scale solution, but for a reasonable training speed / batch size it caps around 3t.

Replies

sosodev • yesterday at 5:48 PM

Do you have any resources to share regarding independent expert training? I was under the impression that it's not feasible.

➕ show 1 reply

alt Hacker News

Replies