logoalt Hacker News

wincytoday at 4:33 AM1 replyview on HN

Crazy, so if I understand correctly, something with B200s and nvlink is causing issues where after 66 days and 12 hours of uptime, nvidia-smi and other jobs start failing, timing out, then once you restart the cluster it starts working again.

They suspect jobs will work if you only use 1 B200, but one person power cycled so wasn’t able to test it. Hopefully they won’t have to wait another 66 days for further troubleshooting.


Replies

layla5alivetoday at 4:42 AM

Some 32-bit counter somewhere used when in NVLINK overflows?

show 2 replies