logoalt Hacker News

bfleschyesterday at 7:26 PM3 repliesview on HN

In his newsletter Ed Zitron hammered down the point that GPUs depreciate quickly, but these kind of reliability issues are shocking to read. The GPUs are so common to fail that they hang out in a 24/7 slack channel with customers like Meta (who apparently can't set up a cluster themselves..).

Ed Zitron also called out the business model of GPU-as-a-service middleman companies like modal deeply unsustainable, and I also don't see how they can make a profit if they are only reselling public clouds. Assuming they are VC funded the VCs need returns for their funds.

Unlike fiber cable during the dot com boom the currently used GPUs eventually end up in the trash bin. These GPUs are treated like toilet paper, you use them and throw them away, nothing you will give to the next generation.

Who will be the one who marks down these "assets"? Who is providing money to buy the next batch of GPUs, now that billions are already spent?

Maybe we'll see a wave of retirements soon.

> It’s underappreciated how unreliable GPUs are. NVIDIA’s hardware is a marvel, the FLOPs are absurd. But the reliability is a drag. A memorable illustration of how AI/ML development is hampered by reliability comes from Meta’s paper detailing the training process for the LLaMA 3 models: “GPU issues are the largest category, accounting for 58.7% of all unexpected issues.” > Imagine the future we’ll enjoy when GPUs are as reliable as CPUs. The Llama3 team’s CPUs were the problem only 0.5% of the time. In my time at Modal we can’t remember finding a single degraded CPU core. > For our Enterprise customers we use a shared private Slack channel with tight SLAs. Slack is connected to Pylon, tracking issues from creation to resolution. Because Modal is built on top of the cloud giants and designed for dynamic compute autoscaling, we can replace bad GPUs pretty fast!


Replies

pixl97yesterday at 7:44 PM

>These GPUs are treated like toilet paper, you use them and throw them away, nothing you will give to the next generation.

I'm guessing this may be highly dependant on what the bathtub curve looks like, and how much the provider wants to spend on cooling.

Of course with Nvidia being a near monopoly here, they might just not give a fuck and will pump out cards/servers with shitty reliability rates simply because people keep buying them and they don't suffer any economic loss or have to sit in front of a judge.

Be interesting to see what the error rate per TFLOP (no /s, we're looking at operations not time) is compared to older generation cards.

show 1 reply
charles_irlyesterday at 9:03 PM

> Ed Zitron also called out the business model of GPU-as-a-service middleman companies like modal deeply unsustainable, and I also don't see how they can make a profit if they are only reselling public clouds.

You got a link for that? I work on Modal and would be interested in seeing the argument!

We think building a proper software layer for multitenant demand aggregation on top of the public clouds is sufficient value-add to be a sustainable business (cf DBRX and Snowflake).

ares623yesterday at 7:43 PM

I suppose NVidia could invest in making their GPUs more reliable? But then that'll make everything else even more expensive lol. If only one of the companies on the chain can take one for the team.

show 2 replies