The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable
FP Division by constant is optimized by a compiler into a multiply. Graphics processing typically happens on the GPU these days, and on all recent GPUs FPMUL belongs to the class of lowest-latency operations. That is, there are no other instructions that complete faster.
Useful, then, that you can start several vectorized floating-point muls each cycle. (E.g., most modern x86 are 3/0.5 cycles for vmulps. No 20 cycles in sight.)
It's 3 cycles for float multiplication (and 1 for shift right):
https://uops.info/table.html?search=mulss&cb_lat=on&cb_tp=on...
https://uops.info/table.html?search=shr&cb_lat=on&cb_tp=on&c...
In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.