Very interesting. From the paper:
H100 shows 3.2 × lower per-GPU mean time between errors (MTBE) compared to A100 for uncorrectable ECC memory errors. The per-GB MTBE of the H100’s HBM3 memory is 24% lower (∼ 8.5M hours) than the A100’s HBM2e memory (∼ 11.3M hours). We conjecture that the reduction in memory resilience stems from H100’s higher memory capacity.
We attribute the decrease in resilience is primarily due to the higher memory capacity (96 GB vs. 40 GB, a 2.4 × increase), which increases the chances of bit flips.
We additionally hypothesize that H100 memory resilience is worse due to (a) a lower signaling voltage that increases susceptibility to bit flips and (b) an increased number of stacks that make heat dissipation challenging and degrade the resilience of memory modules, of the HBM3 memory.
Increasing voltage just makes the heat dissipation problem worse, so probably can't just crank that up.
From what I can gather, a typical A100 or H100 is air cooled. Sounds like liquid cooling them might help, or at least allow you to bump up those voltages without thermal issues.