> Practically, it really is hard, because SIMD instruction sets in CPUs are a mess. X86 and ARM h...

exDM69 • last Saturday at 2:44 PM • 0 replies • view on HN

> Practically, it really is hard, because SIMD instruction sets in CPUs are a mess. X86 and ARM have completely different sets of things that they have instructions for

Not disagreeing it's a mess, but there's also quite a big common subset containing all the basic arithmetic ops and some specialized ones rsqrt, rcp, dot product, etc.

These should be easier to use without having to write the code for each instruction set. And they are with C vector extensions or Rust std::simd.

Some of the inconsistencies you mention are less of a problem in portable simd, taking Rust for example:

- lzcnt and tzcnt: std::simd::SimdInt has both leading_zeros and trailing_zeros (also leading/trailing_ones) for every integer size and vector width.

- horizontal adds: notably missing from std::simd (gotta use intrinsics if you want it), but there is reduce_sum (although it compiles to add and swizzle). Curiously LLVM does not compile `x + simd_swizzle!(x, [1, 0, 3, 2])` into haddps

- absolute values for iBxN and fBxN out of the box.

Also these have fallback code (which is mostly reasonable, but not always) when your target CPU doesn't have the instruction. You'll need to enable the features you want at compile time (-C target-features=+avx2).

> With this kind of inconsistency, any portable SIMD abstraction will be difficult to efficiently compile for the majority of CPUs, negating part of the advantage.

I agree it negates a part of the advantage. But only a part, and for that you have zero cost fallback to intrinsics. And in my projects that part has been tiny compared to the overall amount of SIMD code I've written.

For basic arithmetic ops it's a huge win to have to write the code only once, and use normal math operations (+, -, *, /) instead of memorizing the per-CPU intrinsics for two (or more) CPU vendors.

alt Hacker News