This is so interesting, especially given that it is in theory possible to emulate FP64 using FP32 operations.
I do think though that Nvidia generally didn't see much need for more FP64 in consumer GPUs since they wrote in the Ampere (RTX3090) white paper: "The small number of FP64 hardware units are included to ensure any programs with FP64 code operate correctly, including FP64 Tensor Core code."
I'll try adding an additional graph where I plot the APP values for all consumer GPUs up to 2023 (when the export control regime changed) to see if the argument of Adjusted Peak Performance for FP64 has merit.
Do you happen to know though if GPUs count as vector processors or not under these regulations since the weighing factor changes depending on the definition?
https://www.federalregister.gov/documents/2018/10/24/2018-22... What I found so far is that under Note 7 it says: "A ‘vector processor’ is defined as a processor with built-in instructions that perform multiple calculations on floating-point vectors (one-dimensional arrays of 64-bit or larger numbers) simultaneously, having at least 2 vector functional units and at least 8 vector registers of at least 64 elements each."
Nvidia GPUs have only 32 threads per warp, so I suppose they don't count as a vector processor (which seems a bit weird but who knows)?
> it is in theory possible to emulate FP64 using FP32 operations
I’d say it’s better than theory, you can definitely use float2 pairs of fp32 floats to emulate higher precision. Quad precision using too, using float4. Here’s the code: https://andrewthall.com/papers/df64_qf128.pdf
Also note it’s easy to emulate fp64 using entirely integer instructions. (As a fun exercise, I attempted both doubles and quads in GLSL: https://www.shadertoy.com/view/flKSzG)
While it’s relatively easy to do, these approaches are a lot slower than fp64 hardware. My code is not optimized, not ieee compliant, and not bug-free, but the emulated doubles are at least an order of magnitude slower than fp32, and the quads are two order of magnitude slower. I don’t think Andrew Thall’s df64 can achieve a 1:4 float to double perf ratio either.
And not sure, but I don’t think CUDA SMs are vector processors per se, and not because of the fixed warp size, but more broadly because of the design & instruction set. I could be completely wrong though, and Tensor Cores totally might count as vector processors.
Wikipedia links to this guide to the APP, published in December 2006 (much closer to when the rule itself came out): https://web.archive.org/web/20191007132037/https://www.bis.d.... At the end of the guide is a list of examples.
Only two of these examples meet the definition of vector processor, and these are very clearly classical vector processor computers, the Cray X1E and the NEC SX-8 (as in, if you're preparing a guide on historical development of vector processing, you're going to be explicitly including these systems or their ancestors as canonical examples of what you mean by a vector super computer!). And the definition is pretty clearly tailored to make sure that SIMD units in existing CPUs wouldn't qualify for the definition of vector processor.
The interesting case to point out is the last example, a "Hypothetical coprocessor-based Server" which hypothetically describes something that is actually extremely similar to the result of GPGPU-based HPC systems: "The host microprocessor is a quad-core (4 processors) chip, and the coprocessor is a specialized chip with 64 floating-point engines operating in parallel, attached to the host microprocessor through a specialized expansion bus (HyperTransport or CSI-like)." This hypothetical system is not a "vector processor," it goes on to explain.
From what I can find, it seems that neither NVidia nor the US government considers the GPUs to count as vector processors and thus give it the 0.3 rather than the 0.9 weight.