Had to reach for a new IPC mechanism recently to implement a multi-GPU LLM inference server. My or...

zackangelo • 11/20/2024 • 0 replies • view on HN

Had to reach for a new IPC mechanism recently to implement a multi-GPU LLM inference server.

My original implementation just pinned one GPU to its own thread then used message passing between them in the same process but Nvidia's NCCL library hates this for reasons I haven't fully figured out yet.

I considered gRPC for IPC since I was already using it for the server's API but dismissed it because it was an order of magnitude slower and I didn't want to drag async into the child PIDs.

Serializing the tensors between processes and using the Servo team's ipc-channel crate[0] has worked surprisingly well. If you're using Rust and need a drop-in (ish) replacement for the standard library's channels, give it a shot.

[0] https://github.com/servo/ipc-channel

alt Hacker News