That’s like saying that you have to describe your data flow in terms of gotos because the CPU doesn’t understand for loops and compilers aren’t magic. I don’t mean that autovectorization should just work (tm), I just mean that reasonable portable SIMD abstractions should not be this hard.
There's different ways of approaching it which have different performance consequences. Which is why accelerated libraries are common, but if you want accelerated primitives, you kinda have to roll your own.
> I just mean that reasonable portable SIMD abstractions should not be this hard.
Morally, no, it really ought to not be this hard, we need this. Practically, it really is hard, because SIMD instruction sets in CPUs are a mess. X86 and ARM have completely different sets of things that they have instructions for, and even within the X86 family, even within a particular product class, things are inconsistent:
- On normal words, one has lzcnt (leading-zero count) and tzcnt (trailing-zero count), but on SIMD vectors there is only lzcnt. And you get lzcnt only on AVX512, the latest-and-greatest in X86.
- You have horizontal adds (adding adjacent cells in a vector) for 16-bit ints, 32-bit ints, floats and doubles, and saturating horizontal add for 16-bit ints. https://www.intel.com/content/www/us/en/docs/intrinsics-guid... Where are horizontal adds for 8-bit or 64-bit ints, or any other saturating instructions?
- Since AVX-512 filled up a bunch of gaps in the instruction set, you have absolute value instructions on 8, 16, 32 and 64 bit ints in 128, 256 and 512 bit vectors. But absolute value on floats only exists on 512-bit vectors.
These are just the ones that I could find now, there is more. With this kind of inconsistency, any portable SIMD abstraction will be difficult to efficiently compile for the majority of CPUs, negating part of the advantage.