I help run a fleet of GPU servers, and I might see 1 DIMM or SSD failure for every 50-100 GPU failur...

bluedino • yesterday at 7:54 PM • 10 replies • view on HN

I help run a fleet of GPU servers, and I might see 1 DIMM or SSD failure for every 50-100 GPU failures.

I realize NVIDIA is just cranking them out as fast as they can, but the quality on them is terrible. They overheat, disappear after you reboot, they fall off the bus, memory failures, and then mix in all the software crashes your users generate...

Our current server vendor is actually good at replacing them, unlike our previous vendor, but the failure rates are just insane. If any other component failed this much we'd have the vendor buy the servers back.

Replies

thundergolfer • yesterday at 8:48 PM

Author here. That 1:50-100 ratio looks roughly right based on my research, but my numbers have GPUs faring even worse.

  Component                      Type       MTBF (yrs)  AFR
  ─────────────────────────────────────────────────────────

  SSD                            Hardware   ~100        ~1%
  RAM uncorrectable error        Hardware   ~75         ~1-4%
  NVIDIA A100 critical error†    Hardware   0.18 (65d)  -
  NVIDIA H100 critical error†    Hardware   0.15 (50d)  -

† “Critical error” refers to a NVIDIA Xid or sXid error which is not recoverable, requiring application and GPU reset.

Only a minority of GPU 'failures' appear to be permanent hardware problems, such as row remapping errors. A lot seem to be, like another comment says, a consequence of operating too close to the operational limit, tipping over it, and then requiring a power cycle.

➕ show 3 replies

nickysielicki • today at 12:09 AM

Totally matches my experience, and it feels bizarre inside-looking-out that nobody else talks about it. Hardware from 2010-2020 was remarkably stable, and CPUs are still as stable as they were, but we've had this large influx of money spent on these chips that fall over if you look at them funny. I think it leads to a lot of people thinking, "we must be doing something wrong", because it's just outside of their mental model that hardware failures can occur at this rate. But that's just the world we live in.

It's a perfect storm: a lot of companies are doing HPC-style distributed computing for the first time, and lack experience in debugging issues that are unique to it. On top of that, the hardware is moving very fast and they're ill equipped to update their software and drivers at the rate required to have a good experience. On top of that, the stakes are higher because your cluster is only as strong as its weakest node, which means a single hardware failure can turn the entire multi-million dollar cluster into a paperweight, which adds more pressure and stress to get it all fixed. Updating your software means taking that same multi-million dollar cluster offline for several hours, which is seen as a cost rather than a good investment of time. And a lot of the experts in HPC-style distributed computing will sell you "supported" software, which is basically just paying for the privilege of using outdated software that lacks the bug fixes that your cards might desperately need. That model made sense in the 2010s, when linux (kernel and userspace) was less stable and you genuinely needed to lock your dependencies and let the bugs work themselves out. But that's the exact opposite of what you want to be doing in 2026.

You put all of this together, and it's difficult to be confident whether the hardware is bad, or going bad, or whether it's only manifesting because they're exposed to bugs, or maybe both. Yikes, it's no fun.

dlcarrier • yesterday at 8:04 PM

They're also run far closer to the edge of their operational limits than CPUs, so you're far more likely to get one that barely passes manufacturing tests, then degrades just a little tiny bit and stops working.

salynchnew • today at 4:10 AM

It's wild that these are the failure rates for datacenter-grade products. If you were pushing consumer GPU servers all-out, I would expect this kind of variation.

I expect it's not just a problem with Nvidia, though.

bigwheels • yesterday at 8:09 PM

FWIW, NVIDIA enterprise hardware does come with good warranty and prompt RMA service.

A deep dive on why these beastly cards fail so frequently compared to all other common current day hardware would be fascinating!

➕ show 1 reply

jldugger • yesterday at 10:48 PM

It's funny, I've been watching all the nvidia GTC keynotes from 2012-now to better understand the ecosystem and Jensen pretty clearly states a few times "its a miracle it works at all". Clearly he's intending to brag about defect rate on a 50 billion transistor chip but maybe he's more right than he realizes.

jayd16 • yesterday at 10:04 PM

For comparison they have way more memory than 1 DIMM alone, and plenty of other things going on.

userbinator • today at 3:38 AM

I wonder if GPUs are so dense that SEUs are even more common than in CPUs or RAM.

ecesena • yesterday at 11:37 PM

Has anyone tried to "turn off some cores" (eg using multi-instance gpu feature) and see if/how that increases reliability?

stingrae • today at 3:40 AM

seems like it would be an issue for building datacenters in space/orbit

➕ show 1 reply

alt Hacker News

Replies