Author here - thanks - my bad. Fixed 'fast' -> 'fused' :) I don't have...

vintagedave • today at 6:14 PM • 1 reply • view on HN

Author here - thanks - my bad. Fixed 'fast' -> 'fused' :)

I don't have insight into how Prism works, but I have wondered if the right debugger would see the ARM code and let us debug exactly what was going on for sure.

Replies

Const-me • today at 6:25 PM

You’re welcome. Sadly, I don’t know how to observe ARM assembly produced by Prism.

And one more thing.

If you test on an AMD processor, you will probably see much less profit from FMA. Not because it’s slower, but because SSE4 version will runs much faster.

On Intel processors like your Tiger Lake, all 3 operations addition, multiplication and FMA compete for the same execution units. On AMD processors however, multiplication and FMA do as well but addition is independent, e.g. on Zen4 multiplication and FMA run on execution units FP0 or FP1 while addition runs on execution units FP2 or FP3. This way replacing multiply/add combo with FMA on AMD doesn’t substantially improve throughput in FLOPs. The only win is L1i cache and instruction decoder.

➕ show 1 reply

alt Hacker News

Replies