logoalt Hacker News

exDM69yesterday at 8:46 AM5 repliesview on HN

This right here illustrates why I think there should be better first class SIMD in languages and why intrinsics are limited.

When using GCC/clang SIMD extensions in C (or Rust nightly), the implementation of sin4f and sin8f are line by line equal, with the exception of types. You can work around this with templates/generics.

The sin function is entirely basic arithmetic operations, no fancy instructions are needed (at least for the "computer graphics quality" 32 bit sine function I am using).

Contrast this with intrinsics where the programmer needs to explicitly choose the mm128 or mm256 instruction even for trivial stuff like addition and other arithmetic.

Similarly, a 4x4 matrix multiplication function is the exact same code for 64 bit double and 32 bit float if you're using built in SIMD. A bit of generics and no duplication is needed. Where as intrinsics again needs two separate implementations.

I understand that there are cases where intrinsics are required, or can deliver better performance but both C/C++ and Rust have zero cost fallback to intrinsics. You can "convert" between f32x4 and mm128 at zero cost (no instructions emitted, just compiler type information).

I do use some intrinsics in my SIMD code this way (rsqrt, rcp, ...). The CPU specific code is just a few percent of the overall lines of code, and that's for Arm and x86 combined.

The killer feature is that my code will compile into x86_64/SSE and Aarch64/neon. And I can use wider vectors than the CPU actually supports, the compiler knows how to break it down to what the target CPU supports.

I'm hoping that Rust std::simd would get stabilized soon, I've used it for many years and it works great. And when it doesn't I have a zero cost fallback to intrinsics.

Some very respected people have the opinion that std::simd or its C equivalent suffer from a "least common denominator problem". I don't disagree with the issue but I don't think it really matters when we have a zero cost fallback available.


Replies

camel-cdryesterday at 10:16 AM

My personal gripe with Rust's std::simd in its current form is that it makes writing portable SIMD hard while making non-portable SIMD easy. [0]

> the implementation of sin4f and sin8f are line by line equal, with the exception of types. You can work around this with templates/generics

This is true, I think most SIMD algorithms can be written in such a vector length-agnostic way, however almost all code using std::simd specifies a specific lane count instead of using the native vector length. This is because the API favors the use of fixed-size types (e.g. f32x4), which are exclusively used in all documentation and example code.

If I search github for `f32x4 language:Rust` I get 6.4k results, with `"Simd<f32," language:Rust NOT "Simd<f32, 2" NOT "Simd<f32, 4" NOT "Simd<f32, 8"` I get 209.

I'm not even aware of a way to detect the native vector length using std::simd. You have to use the target-feature or multiversion crate, as shown as the last part of the rust-simd-book [1]. Well, kind of like that, because their suggestion using "suggested_vector_width", which doesn't exist. I could only find a suggested_simd_width.

Searching for "suggested_simd_width language:Rust", we are now down to 8 results, 3 of which are from the target-feature/multiversion crates.

---

What I'm trying to say is that, while being able to specify a fixed SIMD width can be useful, the encouraged default should be "give me a SIMD vector of the specified type corresponding to the SIMD register size". If your problem can only be solved with a specific vector length, great, then hard-code the lane count, but otherwise don't.

See [0] for more examples of this.

[0] https://github.com/rust-lang/portable-simd/issues/364#issuec...

[1] https://calebzulawski.github.io/rust-simd-book/4.2-native-ve...

show 2 replies
MangoToupeyesterday at 11:04 AM

> first class SIMD in languages

People have said this for longer than I've been alive. I don't think it's a meaningful concept.

show 3 replies
the__alchemistyesterday at 12:55 PM

I've been dodging the f32/f64-specificity in rust using macros, but I don't love it. I do it because I'm not sure what else to do.

I think `core::simd` is probably not near, but 512-bit AVX SIMD will be out in a month or two! So, you could use f64x8, f32x16.

I've built my own f32xY, Vec3xY etc structs, since I want to use stable rust, but don't want to wait for core::SIMD. These all have syntaxes that mimic core::SIMD. I set up pack/unpack functions too, but they are still a hassle compared to non-SIMD operations.

THis begs the question: If core::simd doesn't end up ideal (For flexibility etc), how far can we get with a thin-wrapper lib? The ideal API (imo) is transparent, and supports the widest instructions available on your architecture, falling back to smaller ones or non-SIMD.

It also begs the question of if you should stop worrying and love the macro... We are talking f32/f64. Different widths. Different architectures. And now conflated with len 3 vecs, len 4 vecs etc (Where each vector/tensor item it a SIMD intrinsic). How does every vector (Not in the SIMD sense; the most popular one uses AoS SIMD which is IMO a mistake) handle this? Macros. This leads me to think we macro this too.

show 1 reply
ethan_smithyesterday at 3:12 PM

Modern compilers (especially Clang 16+/GCC 13+) have become remarkably good at auto-vectorizing regular scalar code with -O3 -march=native, often matching hand-written SIMD without the maintenance burden.

show 1 reply
ozgrakkurtyesterday at 12:07 PM

std::simd in rust had atrocious compile time last time I tried it, is it fixed already?