Author here. That 1:50-100 ratio looks roughly right based on my research, but my numbers have GPUs faring even worse.
Component Type MTBF (yrs) AFR
─────────────────────────────────────────────────────────
SSD Hardware ~100 ~1%
RAM uncorrectable error Hardware ~75 ~1-4%
NVIDIA A100 critical error† Hardware 0.18 (65d) -
NVIDIA H100 critical error† Hardware 0.15 (50d) -
† “Critical error” refers to a NVIDIA Xid or sXid error which is not recoverable, requiring application and GPU reset.Only a minority of GPU 'failures' appear to be permanent hardware problems, such as row remapping errors. A lot seem to be, like another comment says, a consequence of operating too close to the operational limit, tipping over it, and then requiring a power cycle.
I'm quite surprised the A100 is not much better since the power levels for the Ampere cards I believe is a lot lower.
Does this mean even for a model that fits on a single server that trains for a few weeks will absolutely need a recovery process? Interested in peoples experiences around this.
If you rebooted every server after 35 days, would that get rid of many of the problems?
I'm curious if running them at slightly lower voltage would fix it or if it's a software thing.
> operating too close to the operational limit, tipping over it, and then requiring a power cycle.
GPUs--they're just like us!