nth_set_bit_u64: wouldn't that be __builtin_ctzll(_pdep_u64(1<<n, v)) with BMI2?
I believe the equivalent ARM64 instructions are in SVE2 which isn’t yet supported on Apple’s M-series chips as of M4, sadly.
That's assuming you're ok with your program not running on some older cpus.
That's my guess as well.
Bitstring rank/select is a well-known problem, and the BMI and non-BMI (Hacker's Delight) versions are available as a reference.