This right here illustrates why I think there should be better first class SIMD in languages and why intrinsics are limited.
When using GCC/clang SIMD extensions in C (or Rust nightly), the implementation of sin4f and sin8f are line by line equal, with the exception of types. You can work around this with templates/generics.
The sin function is entirely basic arithmetic operations, no fancy instructions are needed (at least for the "computer graphics quality" 32 bit sine function I am using).
Contrast this with intrinsics where the programmer needs to explicitly choose the mm128 or mm256 instruction even for trivial stuff like addition and other arithmetic.
Similarly, a 4x4 matrix multiplication function is the exact same code for 64 bit double and 32 bit float if you're using built in SIMD. A bit of generics and no duplication is needed. Where as intrinsics again needs two separate implementations.
I understand that there are cases where intrinsics are required, or can deliver better performance but both C/C++ and Rust have zero cost fallback to intrinsics. You can "convert" between f32x4 and mm128 at zero cost (no instructions emitted, just compiler type information).
I do use some intrinsics in my SIMD code this way (rsqrt, rcp, ...). The CPU specific code is just a few percent of the overall lines of code, and that's for Arm and x86 combined.
The killer feature is that my code will compile into x86_64/SSE and Aarch64/neon. And I can use wider vectors than the CPU actually supports, the compiler knows how to break it down to what the target CPU supports.
I'm hoping that Rust std::simd would get stabilized soon, I've used it for many years and it works great. And when it doesn't I have a zero cost fallback to intrinsics.
Some very respected people have the opinion that std::simd or its C equivalent suffer from a "least common denominator problem". I don't disagree with the issue but I don't think it really matters when we have a zero cost fallback available.
> first class SIMD in languages
People have said this for longer than I've been alive. I don't think it's a meaningful concept.
I've been dodging the f32/f64-specificity in rust using macros, but I don't love it. I do it because I'm not sure what else to do.
I think `core::simd` is probably not near, but 512-bit AVX SIMD will be out in a month or two! So, you could use f64x8, f32x16.
I've built my own f32xY, Vec3xY etc structs, since I want to use stable rust, but don't want to wait for core::SIMD. These all have syntaxes that mimic core::SIMD. I set up pack/unpack functions too, but they are still a hassle compared to non-SIMD operations.
THis begs the question: If core::simd doesn't end up ideal (For flexibility etc), how far can we get with a thin-wrapper lib? The ideal API (imo) is transparent, and supports the widest instructions available on your architecture, falling back to smaller ones or non-SIMD.
It also begs the question of if you should stop worrying and love the macro... We are talking f32/f64. Different widths. Different architectures. And now conflated with len 3 vecs, len 4 vecs etc (Where each vector/tensor item it a SIMD intrinsic). How does every vector (Not in the SIMD sense; the most popular one uses AoS SIMD which is IMO a mistake) handle this? Macros. This leads me to think we macro this too.
Modern compilers (especially Clang 16+/GCC 13+) have become remarkably good at auto-vectorizing regular scalar code with -O3 -march=native, often matching hand-written SIMD without the maintenance burden.
std::simd in rust had atrocious compile time last time I tried it, is it fixed already?
My personal gripe with Rust's std::simd in its current form is that it makes writing portable SIMD hard while making non-portable SIMD easy. [0]
> the implementation of sin4f and sin8f are line by line equal, with the exception of types. You can work around this with templates/generics
This is true, I think most SIMD algorithms can be written in such a vector length-agnostic way, however almost all code using std::simd specifies a specific lane count instead of using the native vector length. This is because the API favors the use of fixed-size types (e.g. f32x4), which are exclusively used in all documentation and example code.
If I search github for `f32x4 language:Rust` I get 6.4k results, with `"Simd<f32," language:Rust NOT "Simd<f32, 2" NOT "Simd<f32, 4" NOT "Simd<f32, 8"` I get 209.
I'm not even aware of a way to detect the native vector length using std::simd. You have to use the target-feature or multiversion crate, as shown as the last part of the rust-simd-book [1]. Well, kind of like that, because their suggestion using "suggested_vector_width", which doesn't exist. I could only find a suggested_simd_width.
Searching for "suggested_simd_width language:Rust", we are now down to 8 results, 3 of which are from the target-feature/multiversion crates.
---
What I'm trying to say is that, while being able to specify a fixed SIMD width can be useful, the encouraged default should be "give me a SIMD vector of the specified type corresponding to the SIMD register size". If your problem can only be solved with a specific vector length, great, then hard-code the lane count, but otherwise don't.
See [0] for more examples of this.
[0] https://github.com/rust-lang/portable-simd/issues/364#issuec...
[1] https://calebzulawski.github.io/rust-simd-book/4.2-native-ve...