logoalt Hacker News

formerly_provenyesterday at 10:42 PM0 repliesview on HN

GPU servers always have had crap reliability compared to a normal server (but sticking eight GPUs on a baseboard complicates things). As I understand it (not my domain), this (being a lack of widespread checkpointing and mpift support) is one of the motivating factors for why ML toolkits eschew MPI (besides accelerator-accelerator being an afterthought).