It can, because of how CPUs work with registers and hot code paths and all that.
First normalizing everything and then comparing normalized versions isn’t as fast.
And it also enables “stopping early” when a match has been found / not found, you may not actually have to convert everything.
Running more code per unit of data does not make the code hotter or reduce the register pressure, quite the opposite...