logoalt Hacker News

tomsmedinglast Saturday at 1:04 PM2 repliesview on HN

> I just mean that reasonable portable SIMD abstractions should not be this hard.

Morally, no, it really ought to not be this hard, we need this. Practically, it really is hard, because SIMD instruction sets in CPUs are a mess. X86 and ARM have completely different sets of things that they have instructions for, and even within the X86 family, even within a particular product class, things are inconsistent:

- On normal words, one has lzcnt (leading-zero count) and tzcnt (trailing-zero count), but on SIMD vectors there is only lzcnt. And you get lzcnt only on AVX512, the latest-and-greatest in X86.

- You have horizontal adds (adding adjacent cells in a vector) for 16-bit ints, 32-bit ints, floats and doubles, and saturating horizontal add for 16-bit ints. https://www.intel.com/content/www/us/en/docs/intrinsics-guid... Where are horizontal adds for 8-bit or 64-bit ints, or any other saturating instructions?

- Since AVX-512 filled up a bunch of gaps in the instruction set, you have absolute value instructions on 8, 16, 32 and 64 bit ints in 128, 256 and 512 bit vectors. But absolute value on floats only exists on 512-bit vectors.

These are just the ones that I could find now, there is more. With this kind of inconsistency, any portable SIMD abstraction will be difficult to efficiently compile for the majority of CPUs, negating part of the advantage.


Replies

dzaimalast Saturday at 1:51 PM

If by that absolute value thing you mean _mm512_abs_pd, that's a pseudoinstruction for 'and'ing via a mask that zeroes out the top bit, which can be done equally as well on 128/256-bit vectors, just without an intrinsic for some arbitrary reason. But yeah the gaps are super annoying. Some of my personal picks:

- There's only 8- and 16-bit integer saturating add/subtract, even on AVX-512

- No 8-bit shifts anywhere either; AVX2 only has 32- and 64-bit dynamic shifts (and ≥16-bit constant shifts; no 64-bit arithmetic shift right though!), AVX-512 adds dynamic 16-bit shifts, still no 8-bit shifts (though with some GFNI magic you can emulate constant 8-bit shifts)

- Narrowing integer types pre-AVX-512 is rather annoying, taking multiple instructions. And even though AVX-512 has instructions for narrowing vectors, you're actually better off using multiple-table-input permute instructions and narrowing multiple vectors at the same time.

- Multiplies on x86 are extremely funky (there's a 16-bit high half instr, but no other width; a 32×32→64-bit instr, but no other doubling width instr; proper 32-bit multiply is only from AVX2, proper 64-bit only in AVX-512). ARM NEON doesn't have 64-bit multiplication.

- Extracting a single bit from each element (movemask/movmsk) exists for 8-/32-/64-bit elements, but not 16-bit on x86 pre-AVX512; ARM NEON has none of those, requiring quite long instruction sequences to do so (and you quite benefit from unrolling and packing multiple vectors together, or even doing structure loads to do some of the rearranging)

- No 64-bit int min/max nor 16-bit element top-bit dynamic blend pre-AVX512

show 1 reply
exDM69last Saturday at 2:44 PM

> Practically, it really is hard, because SIMD instruction sets in CPUs are a mess. X86 and ARM have completely different sets of things that they have instructions for

Not disagreeing it's a mess, but there's also quite a big common subset containing all the basic arithmetic ops and some specialized ones rsqrt, rcp, dot product, etc.

These should be easier to use without having to write the code for each instruction set. And they are with C vector extensions or Rust std::simd.

Some of the inconsistencies you mention are less of a problem in portable simd, taking Rust for example:

- lzcnt and tzcnt: std::simd::SimdInt has both leading_zeros and trailing_zeros (also leading/trailing_ones) for every integer size and vector width.

- horizontal adds: notably missing from std::simd (gotta use intrinsics if you want it), but there is reduce_sum (although it compiles to add and swizzle). Curiously LLVM does not compile `x + simd_swizzle!(x, [1, 0, 3, 2])` into haddps

- absolute values for iBxN and fBxN out of the box.

Also these have fallback code (which is mostly reasonable, but not always) when your target CPU doesn't have the instruction. You'll need to enable the features you want at compile time (-C target-features=+avx2).

> With this kind of inconsistency, any portable SIMD abstraction will be difficult to efficiently compile for the majority of CPUs, negating part of the advantage.

I agree it negates a part of the advantage. But only a part, and for that you have zero cost fallback to intrinsics. And in my projects that part has been tiny compared to the overall amount of SIMD code I've written.

For basic arithmetic ops it's a huge win to have to write the code only once, and use normal math operations (+, -, *, /) instead of memorizing the per-CPU intrinsics for two (or more) CPU vendors.