This is where the power and expressiveness of kdb+ shines. It has SIMD primitives out of the box and can optimize your code based on data types to take advantage of it. https://kx.com/blog/what-makes-time-series-database-kdb-so-f...
I don't think the native C++, even when bundled with OMP, goes far enough.
In my experience, ISPC and Google's Highway project lead to better results in practice - this mostly due to their dynamic dispatching features.
CUDA got writing SIMD functions right. You never have to say "_m1024ps" in CUDA, you just say float. CUDA has been a huge success but somehow nobody has really copied this paradigm. It's just weird.
> Function calls also have that negative property that the compiler doesn’t know what happens after calling them so it needs to assume the worst happens. And by the worst, it has to assume that the function can change any memory location and optimize for such a case. So it omits many useful compiler optimizations.
This is not the case in C. It might be technically possible for a function to modify any memory, but it wouldn't be legal, and compilers don't need to optimise for the illegal cases.
> Not many compilers support vector functions
Hrm. Anyone who has spent time writing SIMD optimizations know not to trust the compiler.
If you want to write SIMD then… just use intrinsics? It’s a bit annoying to create a wrapper that supports x64 SSE/AVX and also Arm NEON. But that’s what we’ve been doing for decades.
This right here illustrates why I think there should be better first class SIMD in languages and why intrinsics are limited.
When using GCC/clang SIMD extensions in C (or Rust nightly), the implementation of sin4f and sin8f are line by line equal, with the exception of types. You can work around this with templates/generics.
The sin function is entirely basic arithmetic operations, no fancy instructions are needed (at least for the "computer graphics quality" 32 bit sine function I am using).
Contrast this with intrinsics where the programmer needs to explicitly choose the mm128 or mm256 instruction even for trivial stuff like addition and other arithmetic.
Similarly, a 4x4 matrix multiplication function is the exact same code for 64 bit double and 32 bit float if you're using built in SIMD. A bit of generics and no duplication is needed. Where as intrinsics again needs two separate implementations.
I understand that there are cases where intrinsics are required, or can deliver better performance but both C/C++ and Rust have zero cost fallback to intrinsics. You can "convert" between f32x4 and mm128 at zero cost (no instructions emitted, just compiler type information).
I do use some intrinsics in my SIMD code this way (rsqrt, rcp, ...). The CPU specific code is just a few percent of the overall lines of code, and that's for Arm and x86 combined.
The killer feature is that my code will compile into x86_64/SSE and Aarch64/neon. And I can use wider vectors than the CPU actually supports, the compiler knows how to break it down to what the target CPU supports.
I'm hoping that Rust std::simd would get stabilized soon, I've used it for many years and it works great. And when it doesn't I have a zero cost fallback to intrinsics.
Some very respected people have the opinion that std::simd or its C equivalent suffer from a "least common denominator problem". I don't disagree with the issue but I don't think it really matters when we have a zero cost fallback available.