logoalt Hacker News

mnky9800nyesterday at 5:17 PM9 repliesview on HN

I really don't understand the argument that nvidia GPUs only work for 1-3 years. I am currently using A100s and H100s every day. Those aren't exactly new anymore.


Replies

mbrumlowyesterday at 6:52 PM

It’s not that they don’t work. It’s how businesses handle hardware.

I worked at a few data centers on and off in my career. I got lots of hardware for free or on the cheap simply because the hardware was considered “EOL” after about 3 years, often when support contracts with the vendor ends.

There are a few things to consider.

Hardware that ages produce more errors, and those errors cost, one way or another.

Rack space is limited. A perfectly fine machine that consumes 2x the power for half the output cost. It’s cheaper to upgrade a perfectly fine working system simply because it performs better per watt in the same space.

Lastly. There are tax implications in buying new hardware that can often favor replacement.

show 2 replies
linkregisteryesterday at 5:24 PM

The common factoid raised in financial reports is GPUs used in model training will lose thermal insulation due to their high utilization. The GPUs ostensibly fail. I have heard anecdotal reports of GPUs used for cryptocurrency mining having similar wear patterns.

I have not seen hard data, so this could be an oft-repeated, but false fact.

show 3 replies
denimnerd42yesterday at 5:46 PM

1-3 is too short but they aren’t making new A100s, theres 8 in a server and when one goes bad what do you do? you wont be able to renew a support contract. if you wanna diy you eventually you have to start consolidating pick and pulls. maybe the vendors will buy them back from people who want to upgrade and resell them. this is the issue we are seeing with A100s and we are trying to see what our vendor will offer for support.

iancmceachernyesterday at 5:18 PM

They're no longer energy competitive. I.e. the amount of power per compute exceeds what is available now.

It's like if your taxi company bought taxis that were more fuel efficient every year.

show 3 replies
mbestoyesterday at 6:11 PM

Not saying your wrong. A few things to consider:

(1) We simply don't know what the useful life is going to be because of how new the advancements of AI focused GPUs used for training and inference.

(2) Warranties and service. Most enterprise hardware has service contracts tied to purchases. I haven't seen anything publicly disclosed about what these contracts look like, but the speculation is that they are much more aggressive (3 years or less) than typical enterprise hardware contracts (Dell, HP, etc.). If it gets past those contracts the extended support contracts can typically get really pricey.

(3) Power efficiency. If new GPUs are more power efficient this could be huge savings on energy that could necessitate upgrades.

show 2 replies
swalshyesterday at 8:25 PM

If power is the bottleneck, it may make business sense to rotate to a GPU that better utilizes the same power if the newer generation gives you a significant advantage.

legitsteryesterday at 5:44 PM

From an accounting standpoint, it probably makes sense to have their depreciation be 3 years. But yeah, my understanding is that either they have long service lives, or the customers sell them back to the distributor so they can buy the latest and greatest. (The distributor would sell them as refurbished)

savorypianoyesterday at 5:30 PM

You aren't trying to support ad-based demand like OpenAI is.

linuxftwyesterday at 5:39 PM

I think the story is less about the GPUs themselves, and more about the interconnects for building massive GPU clusters. Nvidia just announced a massive switch for linking GPUs inside a rack. So the next couple of generations of GPU clusters will be capable of things that were previously impossible or impractical.

This doesn't mean much for inference, but for training, it is going to be huge.