Reliability also depends strongly on current density and applied voltage, even more perhaps than on ...

zozbot234 • yesterday at 11:48 PM • 1 reply • view on HN

Reliability also depends strongly on current density and applied voltage, even more perhaps than on thermal density itself. So "slowing down" your average GPU use in a long-term sustainable way ought to improve those reliability figures via multiple mechanisms. Jetsons are great for very small-scale self-contained tasks (including on a performance-per-watt basis) but their limits are just as obvious, especially with the recently announced advances wrt. clustering the big server GPUs on a rack- and perhaps multi-rack level.

Replies

touisteur • today at 7:03 AM

I don't have first-hand knowledge on HBM GPUs but on the RTX Blackwell 6000 Pro Server, the perf difference between the free up-to-600W and the same GPU capped at 300W is less than 10% on any workload I could (including Tensor Core-heavy ones) throw at it.

That's a very expensive 300W and I wonder what tradeoff made them go for this, and whether capping is here a way to increase reliability. ...

Wonder whether there's any writeup on those additional 300 Watts...

➕ show 2 replies

alt Hacker News

Replies