Ideally, you should rearrange batches so that inference steps that rely on the same experts get batc...

zozbot234 • today at 12:32 AM • 1 reply • view on HN

Ideally, you should rearrange batches so that inference steps that rely on the same experts get batched together, then inferences that would "hold up" a batch simply wait for that one "long tail" expert to be loaded, whereupon they can progress. This might require checkpointing partial inference steps more often, but that ought to be doable.

Replies

reitzensteinm • today at 12:46 AM

I think this is doable for very long tail experts that get swapped in for specialised topics - say, orbital mechanics.

But for experts that light up at, say, 1% frequency per batch, you're doing an awful lot of transfers from DRAM which you amortize over a single token, instead of reads from HBM which you amortize over 32 tokens.

alt Hacker News

Replies