On anything but the smallest implementations, the 32 vs 64bit alu cost difference is pretty tiny compared to everything else going on in the core to get performance. And assumes the core doesn't support 32-bit ops, leaving the rest of the ALU idle, or does something like double pumping.
Really the ALU width is an internal implementation detail/optimisation, you can tune it to the size you want at the cost of more cycles to actually complete the full width.
It's the MMU width, not the ALU width, that matters.
Lots of machines are capable of running with 32-bit pointers and 64-bit integers ("Knuth mode" aka "ILP32"). You get a huge improvement in memory density as long as no single process needs more than 4GB of core.