> Meanwhile in just about every other area that uses fp computation it's been the defacto standard for decades.
Not that strongly for more parallel things, quite similar to the situation with atomics on cuDNN. cuBLAS for example has a similar issue with multi-stream handling, though this can be overcome with a proper workspace allocation: https://docs.nvidia.com/cuda/cublas/index.html?highlight=Rep....
Still better than cuDNN where some operations just don't have a reproducible version though. The other fields are at least trying. DL doesn't seem to be.
On that note Intel added reproducible BLAS to oneMKL on CPU and GPU last year. https://www.intel.com/content/www/us/en/developer/archive/tr...