> we were hit with this on a 256 gpu b200 cluster -- at day 66 all our jobs started randomly fail...

nurettin • today at 7:03 AM • 0 replies • view on HN

> we were hit with this on a 256 gpu b200 cluster -- at day 66 all our jobs started randomly failing

ouch

alt Hacker News