Yes multiple GPUs absolutely help with inference even for a single model instance. Some models are s...

Jabrov • yesterday at 5:03 PM • 1 reply • view on HN

Yes multiple GPUs absolutely help with inference even for a single model instance. Some models are simply too big to fit on the largest available GPU.

Check out tensor parallelism

Replies

zozbot234 • yesterday at 10:10 PM

Tensor parallelism is not useful on consumer platforms with slow interconnects, unless compute is really low and you prioritize decreasing latency over throughput. pipeline parallelism (and potentially expert parallelism) are more workable.

alt Hacker News

Replies