I haven't looked at all implementations, but the hardware (tensor cores as well as cuda cores) ...

KeplerBoy • 10/11/2024 • 0 replies • view on HN

I haven't looked at all implementations, but the hardware (tensor cores as well as cuda cores) allows you to accumulate at fp16 precision.

alt Hacker News