> we were hit with this on a 256 gpu b200 cluster -- at day 66 all our jobs started randomly failing
ouch