I haven't looked at all implementations, but the hardware (tensor cores as well as cuda cores) allows you to accumulate at fp16 precision.