The obvious answer is that XOR is faster. To do a subtract, you have to propagate the carry bit from the least-significant bit to the most-significant bit. In XOR you don't have to do that because the output of every bit is independent of the other adjacent bits.
Probably, there are ALU pipeline designs where you don't pay an explicit penalty. But not all, and so XOR is faster.
Surely, someone as awesome as Raymond Chen knows that. The answer is so obvious and basic I must be missing something myself?
> The answer is so obvious
A tangent, but what is Obvious depends on what you know.
Often experts don't explain the things they think are Obvious, but those things are only Obvious to them, because they are the expert.
We should all kind, and explain also the Obvious things those who do not know.
That comment is not very useful without pointing to realworld CPUs where SUB is more expensive than XOR ;)
E.g. on Z80 and 6502 both have the same cycle count.
Yea, that’s what immediately went through my head, too. XOR is ALWAYS going to be single cycle because it’s bit-parallel.
I'm not actually aware of any CPUs that preform a XOR faster than a SUB. And more importantly, they have identical timings on the 8086, which is where this pattern comes from.
As TFA says, on x86 `sub eax, eax` encodes to the same number of bytes and executes in the same number of cycles.
From TFA:
The predominance of these idioms as a way to zero out a register led Intel to add special xor r, r-detection and sub r, r-detection in the instruction decoding front-end and rename the destination to an internal zero register, bypassing the execution of the instruction entirely.
From TFA:
> It encodes to the same number of bytes, executes in the same number of cycles.
I had a similar reaction when learning 8086 assembly and finding the correct way to do `if x==y` was a CMP instruction which performed a subtraction and set only the flags. (The book had a section with all the branch instructions to use for a variety of comparison operators.) I think I spent a few minutes experimenting with XOR to see if I could fashion a compare-two-values-and-branch macro that avoided any subtraction.
The operation is slightly more complex yes, but has there ever been an x86 CPU where SUB or XOR takes more than a single CPU cycle?
I would be surprised if modern CPUs didn't decode "xor eax, eax" into a set of micro-ops that simply moves from an externally invisible dedicated 0 register. These days the x86 ISA is more of an API contract than an actual representation of what the hardware internals do.
It's like 0.5 cycles vs 0.9 cycles. So both are 1 cycle, considering synchronization.
XOR is faster when you do that alone in an FPGA or in an ASIC.
When you do XOR together with many other operations in an ALU (arithmetic-logical unit), the speed is determined by the slowest operation, so the speed of any faster operation does not matter.
This means that in almost all CPUs XOR and addition and subtraction have the same speed, despite the fact that XOR could be done faster.
In a modern pipelined CPU, the clock frequency is normally chosen so that a 64-bit addition can be done in 1 clock cycle, when including all the overheads caused by registers, multiplexers and other circuitry outside the ALU stages.
Operations more complex than 64-bit addition/subtraction have a latency greater than 1 clock cycle, even if one such operation can be initiated every clock cycle in one of the execution pipelines.
The operations less complex than 64-bit addition/subtraction, like XOR, are still executed in 1 clock cycle, so they do not have any speed advantage.
There have existed so-called superpipelined CPUs, where the clock frequency is increased, so that even addition/subtraction has a latency of 2 or more clock cycles.
Only in superpipelined CPUs it would be possible to have a XOR instruction that is faster than subtraction, but I do not know if this has ever been implemented in a real superpipelined CPU, because it could complicate the execution pipeline for negligible performance improvements.
Initially superpipelining was promoted by DEC as a supposedly better alternative to the superscalar processors promoted by IBM. However, later superpipelining was abandoned, because the superscalar approach provides better energy efficiency for the same performance. (I.e. even if for a few years it was thought that a Speed Demon beats a Brainiac, eventually it was proven that a Brainiac beats a Speed Demon, like shown in the Apple CPUs)
While mainstream CPUs do not use superpipelining, there have been some relatively recent IBM POWER CPUs that were superpipelined, but for a different reason than originally proposed. Those POWER CPUs were intended for having good performance only in multi-threaded workloads when using SMT, and not in single-thread applications. So by running simultaneous threads on the same ALU the multi-cycle latency of addition/subtraction was masked. This technique allowed IBM a simpler implementation of a CPU intended to run at 5 GHz or more, by degrading only the single-thread performance, without affecting the SMT performance. Because this would not have provided any advantage when using SMT, I assume that in those POWER CPUs XOR was not made faster than subtraction, even if this would have theoretically been possible.
The non-obvious bit is why there isn't an even faster and shorter "mov <register>,0" instructions - the processors started short-circuiting xor <register>,<register> much later.
> The obvious answer is that XOR is faster.
It used to be not only faster but also smaller. And back then this mattered.
Say you had a computer running at 33 Mhz, you had 33 million cycles per second to do your stuff. A 60 Hz game? 33 million / 60 and suddenly you only have about 500 000 cycles per frame. 200 scanlines? Suddenly you're left with only 2500 cycles per scanline to do your stuff. And 2500 cycles really isn't that much.
So every cycle counted back then. We'd use the official doc and see how many cycles each instruction would take. And we'd then verify by code that this was correct too. And memory mattered too.
XOR was both faster and smaller (less bytes) then a MOV ..., 0.
Full stop.
And when those CPU first began having cache, the cache were really tiny at first: literally caching ridiculously low number of CPU instructions. We could actually count the size of the cache manually (for example by filling with a few NOP instructions then modifying them to, say, add one, and checking which result we got at the end).
XOR, due to being smaller, allowed to put more instructions in the cache too.
Now people may lament that it persisted way long after our x86 CPUs weren't even real x86 CPUs anymore and that is another topic.
But there's a reason XOR was used and people should deal with it.
We zero with XOR EAX,EAX and that's it.
Because he is explicitly talking about x86 - maybe you missed that.
XOR and SUB have had identical cycle counts and latencies since the 8088. That's because you can "look ahead" when doing carries in binary. It's just a matter of how much floorspace on the chip you want to use.
https://en.wikipedia.org/wiki/Carry-lookahead_adder
The only minor difference between the two on x86, really, is SUB sets OF and CF according to the result while XOR always clears them.
His point is that in x86 there is no performance difference but everyone except his colleague/friend uses xor, while sub actually leaves cleaner flags behind. So he suspects its some kind of social convention selected at random and then propagated via spurious arguments in support (or that it “looks cooler” as a bit of a term of art).
It could also be as a result of most people working in assembly being aware of the properties of logic gates, so they carry the understanding that under the hood it might somehow be better.