logoalt Hacker News

logicprogyesterday at 5:33 PM1 replyview on HN

DSv4 is nearly in the 2t range, but yes you're generally right


Replies

himata4113yesterday at 5:37 PM

MoE experts were likely trained independently / in a sparse format. Training anything beyond 2t on typical systems would be infuriantingly slow, you could do 4t on nvidias room-scale solution, but for a reasonable training speed / batch size it caps around 3t.

show 1 reply