If by that absolute value thing you mean _mm512_abs_pd, that's a pseudoinstruction for 'and'ing via a mask that zeroes out the top bit, which can be done equally as well on 128/256-bit vectors, just without an intrinsic for some arbitrary reason. But yeah the gaps are super annoying. Some of my personal picks:
- There's only 8- and 16-bit integer saturating add/subtract, even on AVX-512
- No 8-bit shifts anywhere either; AVX2 only has 32- and 64-bit dynamic shifts (and ≥16-bit constant shifts; no 64-bit arithmetic shift right though!), AVX-512 adds dynamic 16-bit shifts, still no 8-bit shifts (though with some GFNI magic you can emulate constant 8-bit shifts)
- Narrowing integer types pre-AVX-512 is rather annoying, taking multiple instructions. And even though AVX-512 has instructions for narrowing vectors, you're actually better off using multiple-table-input permute instructions and narrowing multiple vectors at the same time.
- Multiplies on x86 are extremely funky (there's a 16-bit high half instr, but no other width; a 32×32→64-bit instr, but no other doubling width instr; proper 32-bit multiply is only from AVX2, proper 64-bit only in AVX-512). ARM NEON doesn't have 64-bit multiplication.
- Extracting a single bit from each element (movemask/movmsk) exists for 8-/32-/64-bit elements, but not 16-bit on x86 pre-AVX512; ARM NEON has none of those, requiring quite long instruction sequences to do so (and you quite benefit from unrolling and packing multiple vectors together, or even doing structure loads to do some of the rearranging)
- No 64-bit int min/max nor 16-bit element top-bit dynamic blend pre-AVX512
I know dzaima is aware, but for all the other posters who might not be, our Highway library provides all these missing instructions, via emulation if required.
I do not understand why folks are still making do with direct use of intrinsics or compiler builtins. Having a library centralize workarounds (such an an MSAN compiler change which hit us last week) seems like an obvious win.