logoalt Hacker News

menaerus10/14/20241 replyview on HN

Right, I found this interesting as a thought exercise and took it from another angle.

Since it takes 4 cycles to execute FMA on double-precision 64-bit floats (VFMADD132PD) this translates to 1.25G ops/s (GFLOPS/s) per each core@5GHz. At 192 cores this is 240 GFLOPS/s. For a single FMA unit. At 2x FMA units per core this becomes 480 GFLOPS/s.

For 16-bit operations this becomes 1920 GFLOPS/s or 1.92 TFLOPS/s for FMA workloads.

Similarly, 16-bit FADD workloads are able to sustain more at 2550 GFLOPS/s or 2.55 TFLOPS/s since the FADD is a bit cheaper (3 cycles).

This means that for combined half-precision FADD+FMA workloads zen5 at 192 cores should be able to sustain ~4.5 TFLOPS/s.

Nvidia H100 OTOH per wikipedia entries, if correct, can sustain 50-65 TFLOP/s at single-precision and 750-1000 TFLOPS/s at half-precision. Quite a difference.


Replies

Remnant4410/15/2024

The execution units are fully pipelined, so although the latency is four cycles, you can receive one result every cycle from each of the execution units.

For a Zen 5 core, that means 16 double precision FMAs per cycle using AVX 512, so 80gflop per core at 5ghz, or twice that using fp32

show 1 reply