Keeping 20k GPUs healthy

112 points • by jxmorris12 • last Sunday at 4:16 PM • 52 comments • view on HN

Comments

I help run a fleet of GPU servers, and I might see 1 DIMM or SSD failure for every 50-100 GPU failures.

I realize NVIDIA is just cranking them out as fast as they can, but the quality on them is terrible. They overheat, disappear after you reboot, they fall off the bus, memory failures, and then mix in all the software crashes your users generate...

Our current server vendor is actually good at replacing them, unlike our previous vendor, but the failure rates are just insane. If any other component failed this much we'd have the vendor buy the servers back.

➕ show 10 replies

gregjm • today at 6:32 AM

I wonder why H100 H2D and D2H unpinned memcpy bandwidth is *faster* on PCIe with vendor B than on SXM with vendor D. Is resizable BAR available on PCIe but not SXM?

Or, could it be a software configuration difference? The driver API flag CU_MEMHOSTREGISTER_IOMEMORY states that host memory being physically contiguous may matter to the driver, in this context for memory-mapped memory. If vendor B has THP enabled or configured differently than vendor D, small allocations up to 2 MiB could be physically contiguous which may result in higher efficiency/more bytes transferred per request.

At a higher level: unpinned memcpy is a performance antipattern. Perhaps vendor D has fewer clients using unpinned memcpy in their workloads than vendor B, or they decided not to dedicate support to it for this reason. TensorFlow will go to great lengths to copy unpinned memory to a pinned staging buffer if you feed unpinned host memory tensors to a graph.

➕ show 1 reply

zkvx7a • today at 12:29 AM

A taxonomy and statistics of GPU failures are described in this paper

Story of Two GPUs: Characterizing the Resilience of Hopper H100 and Ampere A100 GPUs

https://dl.acm.org/doi/10.1145/3712285.3759821

➕ show 1 reply

smsx • yesterday at 9:24 PM

Are the numbers in the H100 PCIE vs SXM table swapped for rows 3 onwards? It looks to me like the PCI is showing higher GiB/s numbers, which is counter to expectations. Or am I misunderstanding those benchmarks?

bflesch • yesterday at 7:26 PM

In his newsletter Ed Zitron hammered down the point that GPUs depreciate quickly, but these kind of reliability issues are shocking to read. The GPUs are so common to fail that they hang out in a 24/7 slack channel with customers like Meta (who apparently can't set up a cluster themselves..).

Ed Zitron also called out the business model of GPU-as-a-service middleman companies like modal deeply unsustainable, and I also don't see how they can make a profit if they are only reselling public clouds. Assuming they are VC funded the VCs need returns for their funds.

Unlike fiber cable during the dot com boom the currently used GPUs eventually end up in the trash bin. These GPUs are treated like toilet paper, you use them and throw them away, nothing you will give to the next generation.

Who will be the one who marks down these "assets"? Who is providing money to buy the next batch of GPUs, now that billions are already spent?

Maybe we'll see a wave of retirements soon.

> It’s underappreciated how unreliable GPUs are. NVIDIA’s hardware is a marvel, the FLOPs are absurd. But the reliability is a drag. A memorable illustration of how AI/ML development is hampered by reliability comes from Meta’s paper detailing the training process for the LLaMA 3 models: “GPU issues are the largest category, accounting for 58.7% of all unexpected issues.” > Imagine the future we’ll enjoy when GPUs are as reliable as CPUs. The Llama3 team’s CPUs were the problem only 0.5% of the time. In my time at Modal we can’t remember finding a single degraded CPU core. > For our Enterprise customers we use a shared private Slack channel with tight SLAs. Slack is connected to Pylon, tracking issues from creation to resolution. Because Modal is built on top of the cloud giants and designed for dynamic compute autoscaling, we can replace bad GPUs pretty fast!

➕ show 3 replies

eleventyseven • yesterday at 9:18 PM

> Today, we’re sharing our GPU reliability system as both a demonstration of our commitment to Modal customers and as a guide for fellow travelers renting hyperscaler or neocloud cards. It’s dangerous to go alone! Take this.

> We’ve chosen not to refer to cloud providers directly, but instead give them anonymized A, B, C, D identifiers. If you want know who’s who, track the clues or buy us a beer sometime.

Come on, either name names or admit it is pure PR.

Edit: or will someone who can decode the clues weigh in?

➕ show 1 reply

alt Hacker News

Keeping 20k GPUs healthy

Comments