The primary (non malicious, non stupid) explanation given here is batching. But I think you would fi...

stefan_ • yesterday at 6:33 PM • 1 reply • view on HN

The primary (non malicious, non stupid) explanation given here is batching. But I think you would find looking at large-scale inference the batch sizes being ran on any given rig are fairly static - there is a sweet spot for any given model part ran individually between memory consumption and GPU utilization, and generally GPUs do badly at job parallelism.

I think the more likely explanation is again with the extremely heterogeneous compute platforms they run on.

Replies

hatmanstack • yesterday at 7:39 PM

That's why I'd love to get stats on load/hardware/location of where my inference is running. Looking at you Trainiuim.

alt Hacker News

Replies