I think part of the issue is that in production deployments, you're batching high enough that y...

reitzensteinm • today at 12:26 AM • 1 reply • view on HN

I think part of the issue is that in production deployments, you're batching high enough that you'll be paging in those long tail experts constantly.

Unless you're handing that in some kind of fancy way, you'll be holding up the batch while waiting for host memory which will kill your throughout.

It makes much more sense for non batched local inference, especially if you can keep the MoE routing stable like you say, but most folks aren't optimising for that.

Replies

zozbot234 • today at 12:32 AM

Ideally, you should rearrange batches so that inference steps that rely on the same experts get batched together, then inferences that would "hold up" a batch simply wait for that one "long tail" expert to be loaded, whereupon they can progress. This might require checkpointing partial inference steps more often, but that ought to be doable.

➕ show 1 reply

alt Hacker News

Replies