386 is a microprogrammed CPU where a multiplication is dome by a long sequence of microinstructions, including a loop that is executed a variable number of times, hence its long and variable execution time.
A register-register operation required 2 microinstructions, presumably for an ALU operation and for writing back into the register file.
Unlike the later 80486 which had execution pipelines that allowed consecutive ALU operations to be executed back-to-back, so the throughput was 1 ALU operation per clock cycle, in 80386 there was only some pipelining of the overall instruction execution, i.e. instruction fetching and decoding was overlapped with microinstruction execution, but there was no pipelining at a lower level, so it was not possible to execute ALU operations back to back. The fastest instructions required 2 clock cycles and most instructions required more clock cycles.
In 80386, the ALU itself required the same 1 clock cycle for executing either XOR or SUB, but in order to complete 1 instruction the minimum time was 2 clock cycles.
Moreover, this time of 2 clock cycles was optimistic, it assumed that the processor had succeeded to fetch and decode the instruction before the previous instruction was completed. This was not always true, so a XOR or a SUB could randomly require more than 2 clock cycles, when it needed to finish instruction decoding or fetching before doing the ALU operation.
In very old or very cheap processors there are no dedicated multipliers and dividers, so a multiplication or division is done by a sequence of ALU operations. In any high performance processor, multiplications are done by dedicated multipliers and there are also dedicated division/square root devices with their own sequencers. The dividers may share some circuits with the multipliers, or not. When the dividers share some circuits with the multipliers, divisions and multiplications cannot be done concurrently.
In many CPUs, the dedicated multipliers may share some surrounding circuits with an ALU, i.e. they may be connected to the same buses and they may be fed by the same scheduler port, so while a multiplication is executed the associated ALU cannot be used. Nevertheless the core multiplier and ALU remain distinct, because a multiplier and an ALU have very distinct structures. An ALU is built around an adder by adding a lot of control gates that allow the execution of related arithmetic operations, e.g. subtraction/comparison/increment/decrement and of bitwise operations. In cheaper CPUs the ALU can also do shifts and rotations, while in more performant CPUs there may be a dedicated shifter separated from the ALU.
The term ALU can be used with 2 different senses. The strict sense is that an ALU is a digital adder augmented with control gates that allow the selection of any operation from a small set, typically of 8 or 16 or 32 operations, which are simple arithmetic or bitwise operations. Before the monolithic processors, computers were made using separate ALU circuits, like TI SN74181+SN74182 or circuits combining an ALU with registers, e.g. AMD 2901/2903.
In the wide sense, ALU may be used to designate an execution unit of a processor, which may include many subunits, which may be ALUs in the strict sense, shifters, multipliers, dividers, shufflers etc.
An ALU in the strict sense is the minimal kind of execution unit required by a processor. The modern high-performance processors have much more complex execution units.