And NVIDIA supposedly has the exact knowhow for reliablity, as their Jetson 'industrial' parts are qualified for 10-15 years at maximal temp. Of course Jetson is on another point of the flops and watts curve.
Just wondering, if reliability increases if you slow down your use of GPUs a bit. Like pausing more often and stopping chasing every bubble and nvlink-all-reduce optimization.
Jetson uses LPDDR though. H100 failures seem driven by HBM heat sensitivity and the 700W+ envelope. That is a completely different thermal density I guess.