> it is in theory possible to emulate FP64 using FP32 operations I’d say it’s better than theor...

dahart • today at 5:55 AM • 1 reply • view on HN

> it is in theory possible to emulate FP64 using FP32 operations

I’d say it’s better than theory, you can definitely use float2 pairs of fp32 floats to emulate higher precision. Quad precision using too, using float4. Here’s the code: https://andrewthall.com/papers/df64_qf128.pdf

Also note it’s easy to emulate fp64 using entirely integer instructions. (As a fun exercise, I attempted both doubles and quads in GLSL: https://www.shadertoy.com/view/flKSzG)

While it’s relatively easy to do, these approaches are a lot slower than fp64 hardware. My code is not optimized, not ieee compliant, and not bug-free, but the emulated doubles are at least an order of magnitude slower than fp32, and the quads are two order of magnitude slower. I don’t think Andrew Thall’s df64 can achieve a 1:4 float to double perf ratio either.

And not sure, but I don’t think CUDA SMs are vector processors per se, and not because of the fixed warp size, but more broadly because of the design & instruction set. I could be completely wrong though, and Tensor Cores totally might count as vector processors.

Replies

adrian_b • today at 11:52 AM

What is easy to do is to emulate FP128 with FP64 (double-double) or even FP256 with FP64.

The reason is that the exponent range of FP64 is typically sufficient to avoid overflows and underflows in most applications.

On the other hand, the exponent range of FP32 is insufficient for most scientific-technical computing.

For an adequate exponent range, you must use either three FP32 per FP64, or two FP32 and an integer. In this case the emulation becomes significantly slower than the simplistic double-single emulation.

With the simpler double-single emulation, you cannot expect to just plug it in most engineering applications, e.g. SPICE for electronic circuit simulation, and see that the application works. Some applications could be painstakingly modified to work with such an implementation, but that is not normally an option.

So to be interchangeable with the use of standard FP64 you really must also emulate the exponent range, at the price of much slower emulation.

I did this at some point in the past, but today it makes no sense in comparison with the available alternatives.

Today, the best FP64 performance per dollar by far, is achieved with Ryzen 9950X or Ryzen 9900X, in combination with Inter Battlemage B580 GPUs.

When money does not matter, you can use AMD Epyc in combination with AMD "datacenter" GPUs, which would achieve much better performance per watt, but the performance per dollar would be abysmally low.

➕ show 1 reply

alt Hacker News

Replies