logoalt Hacker News

thundergolferyesterday at 8:48 PM4 repliesview on HN

Author here. That 1:50-100 ratio looks roughly right based on my research, but my numbers have GPUs faring even worse.

  Component                      Type       MTBF (yrs)  AFR
  ─────────────────────────────────────────────────────────

  SSD                            Hardware   ~100        ~1%
  RAM uncorrectable error        Hardware   ~75         ~1-4%
  NVIDIA A100 critical error†    Hardware   0.18 (65d)  -
  NVIDIA H100 critical error†    Hardware   0.15 (50d)  -
† “Critical error” refers to a NVIDIA Xid or sXid error which is not recoverable, requiring application and GPU reset.

Only a minority of GPU 'failures' appear to be permanent hardware problems, such as row remapping errors. A lot seem to be, like another comment says, a consequence of operating too close to the operational limit, tipping over it, and then requiring a power cycle.


Replies

salynchnewtoday at 4:12 AM

> operating too close to the operational limit, tipping over it, and then requiring a power cycle.

GPUs--they're just like us!

layoricyesterday at 9:23 PM

I'm quite surprised the A100 is not much better since the power levels for the Ampere cards I believe is a lot lower.

Does this mean even for a model that fits on a single server that trains for a few weeks will absolutely need a recovery process? Interested in peoples experiences around this.

show 1 reply
shrubbleyesterday at 9:07 PM

If you rebooted every server after 35 days, would that get rid of many of the problems?

show 1 reply
jvalenciayesterday at 9:57 PM

I'm curious if running them at slightly lower voltage would fix it or if it's a software thing.