I believe I first saw this on IACA; uops.info has the measurements for zero-latency inc, add, etc on Alder Lake https://uops.info/html-instr/INC_R64.html . These adds by immediate are nicely closed, so I've been assuming renamed values are uniformly represented in Golden Cove as register+increment.
> Since the only Alder Lake machine I had access to was a remote Windows machine that didn’t belong to me, I more-or-less had to choose option 3, which meant subjecting myself to The Ultimate Sadness
Well, you can pick up Sapphire Rapids instances from your preferred cloud provider and avoid the sadness.
That's pretty cool.
Normally it would be the either the programmer's or the compiler's job to unroll a loop and then reduce dependency chain lengths.
But its nice if the renamer can do that as well.
Presumably intel have real-world data that suggest that significant real workloads can profit from this.
I wonder whether that points to specific software issues, like hypothetically "oh yeah, openjdk8 hotspot was a little too timid at loop unrolling. It won't get that JIT improvement backported, but our customers will use java8 forever. Better fix that in silicon".
Note that, not only are multiple consecutive increments reduced to zero latency, but that happens even if they're interleaved with movsxd, as in the second experiment at https://uops.info/html-lat/ADL-P/INC_R64-Measurements.html. It'd be interesting to see what other instructions it can "fuse" with (if that is what is happening).
Deep thoughts: why aren’t “increment” and “excrement” opposites?
You have to use an instruction like cpuid with rdtsc so that the TSC is not read before the loop terminates. There have been changes to the Intel docs and there are more options now:
https://stackoverflow.com/a/58146426
Also in the bad old days SMM would interfere on some CPUs.
Just when you get used with features like x86 CPUs combining two instructions into one micro-op (micro-op fusing), you get something like this.
I guess immediate addressing mode addition is a good choice to execute at rename / allocation stage, as it's common, relatively simple and can't generate exceptions.
uops.info's measurements show 'inc r64', interleaved with 'movsxd' instructions, still having zero latency[0], so it can't be just merging the immediates of successive increments (or there's additional fusion happening). Plain unrolled 'inc r64' shows an average latency of 0.2 cycles, i.e. 5 dependent ops per cycle. And 0.2 used ports per instr [1].
Similarly, 'lea r64, [r64+8]' (imm8) and 'lea r64, [r64+128]' (imm32) and 'add r64, 2' (imm8); but not 'add r64, 0x1000000' (imm32).
[0]: https://uops.info/html-lat/ADL-P/INC_R64-Measurements.html
[1]: https://uops.info/html-tp/ADL-P/INC_R64-Measurements.html
Thinking about this - this may be a pattern that;s designed to match something that expands from a string instruction.
While the loop he's testing is a useless bit of code that does nothing the optimisation he's discovered may help speed things like scasb/stosb allowing portions of 2 unrolled copies to be processed per clock