I once wrote that algorithm, divided into single lines, intending each line to be a single 64-bit ARM instruction. The compiler did idiom detection, transforming it to "builtin popcnt" and (because 64-bit ARMv8.0 lacks a POPCNT instruction) back to the same algorithm. Only that the emitted code was one instruction larger than my code.
64-bit ARM's actually has a very peculiar encoding of immediates to arithmetic instructions. It supports only recurring bit patterns such as used by this algorithm. For example "add x2, x3, #3333333333333333" is encoded as one four-byte instruction.
> because 64-bit ARMv8.0 lacks a POPCNT instruction
It does have this: https://developer.arm.com/documentation/ddi0596/2021-09/SIMD...
And GCC happily uses it https://godbolt.org/z/dTW46f9Kf