XOR and SUB have had identical cycle counts and latencies since the 8088. That's because you can "look ahead" when doing carries in binary. It's just a matter of how much floorspace on the chip you want to use.
https://en.wikipedia.org/wiki/Carry-lookahead_adder
The only minor difference between the two on x86, really, is SUB sets OF and CF according to the result while XOR always clears them.
A carry lookahead adder makes your circuit depth logarithmic in the width of the inputs vs linear for a ripple carry adder, but that is still asymptotically worse than XORs constant depth.
(But this does not discount the fact that basically all CPUs treat them both as one cycle)
OF/CF/AF are always cleared anyway by SUB r,r. So there's absolutely no difference.