The execution units are fully pipelined, so although the latency is four cycles, you can receive one...

Remnant44 • 10/15/2024 • 1 reply • view on HN

The execution units are fully pipelined, so although the latency is four cycles, you can receive one result every cycle from each of the execution units.

For a Zen 5 core, that means 16 double precision FMAs per cycle using AVX 512, so 80gflop per core at 5ghz, or twice that using fp32

Replies

menaerus • 10/15/2024

You're absolutely right, not sure why I dumbed down my example to a single instruction. Correct way to estimate this number is to feed and keep the whole pipeline busy.

This is actually a bit crazy when you stop and think about it. Nowadays CPUs are packing more and more cores per die at somewhat increasing clock frequencies so they are actually coming quite close to the GPUs.

I mean, top of the line Nvidia H100 can sustain ~30 to ~60 TFLOPS whereas Zen 5 with 192 cores can do only half as much, ~15 to ~30 TFLOPS. This is not even a 10x difference.

➕ show 1 reply

alt Hacker News

Replies