Your compiler will know the best way to popcount, that is the point of that builtin. It'...

dmitrygr • 11/20/2024 • 2 replies • view on HN

Your compiler will know the best way to popcount, that is the point of that builtin. It'll use the best method - sometimes this one. GCC does this, MSVC does this, clang does this, i think even rust has some way to do it (EDIT: it does: count_ones()). On archs which lack POPCNT, it will use this method or another, based on knowing the target. On x86 this approach is OK as is. On arm64, for example, it will be suboptimal due to all the literals needed. On armv6m, this method is bad and table lookups are faster.

Replies

SkiFire13 • 11/20/2024

Note that by default rustc targets x86-64-v1 when compiling for x86-64, and that lacks the popcount instruction. You need to change the target_cpu to at least x86-64-v2 or enable the popcount target_feature. This means that even if your cpu is relatively new and you intend to run your code on relatively new cpus, rustc will still generate older and slower code for count_ones() using bitshifts and masks. That said, I don't see the point in writing them manually if the compiler can generate them for you.

➕ show 1 reply

Findecanor • 11/20/2024

I once wrote that algorithm, divided into single lines, intending each line to be a single 64-bit ARM instruction. The compiler did idiom detection, transforming it to "builtin popcnt" and (because 64-bit ARMv8.0 lacks a POPCNT instruction) back to the same algorithm. Only that the emitted code was one instruction larger than my code.

64-bit ARM's actually has a very peculiar encoding of immediates to arithmetic instructions. It supports only recurring bit patterns such as used by this algorithm. For example "add x2, x3, #3333333333333333" is encoded as one four-byte instruction.

➕ show 1 reply

alt Hacker News

Replies