logoalt Hacker News

lacedeconstructyesterday at 8:13 PM3 repliesview on HN

The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable


Replies

exyiyesterday at 8:49 PM

It's 3 cycles for float multiplication (and 1 for shift right):

https://uops.info/table.html?search=mulss&cb_lat=on&cb_tp=on...

https://uops.info/table.html?search=shr&cb_lat=on&cb_tp=on&c...

In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.

Tuna-Fishyesterday at 8:46 PM

FP Division by constant is optimized by a compiler into a multiply. Graphics processing typically happens on the GPU these days, and on all recent GPUs FPMUL belongs to the class of lowest-latency operations. That is, there are no other instructions that complete faster.

show 2 replies
Sesse__yesterday at 8:35 PM

Useful, then, that you can start several vectorized floating-point muls each cycle. (E.g., most modern x86 are 3/0.5 cycles for vmulps. No 20 cycles in sight.)