The binary portability issue is not specific to Rust or std::simd. You would have to solve the same ...

exDM69 • last Saturday at 10:46 AM • 2 replies • view on HN

The binary portability issue is not specific to Rust or std::simd. You would have to solve the same problems even if you use intrinsics or C++. If you use 512 but vectors in the code, you will need to check if the CPU supports it or add multiversioning dispatch or you will get a SIGILL.

I have written both type generic (f32 vs f64) and width generic (f32x4 vs f32x8) SIMD code with Rust std::simd.

And I agree it's not very pretty. I had to resort to having a giant where clause for the generic functions, explicitly enumerating the required std::ops traits. C++ templates don't have this particular issue, and I've used those for the same purpose too.

But even though the implementation of the generic functions is quite ugly indeed, using the functions once implemented is not ugly at all. It's just the "primitive" code that is hairy.

I think this was a huge missed opportunity in the core language, there should've been a core SIMD type with special type checking rules (when unifying) for this.

However, I still think std::simd is miles better than intrinsics for 98% of the SIMD code I write.

The other 1% (times two for two instruction sets) is just as bad as it is in any other language with intrinsics.

The native vector width and target-feature multiversioning dispatch are quite hairy. Adding some dynamic dispatch in the middle of your hot loops can also have disastrous performance implications because they tend to kill other optimizations and make the cpu do indirect jumps.

Have you tried just using the widest possible vector size? e.g. f64x64 or something like it. The compiler can split these to the native vector width of the compiler target. This happens at compile time so it is not suitable if you want to run your code on CPUs with different native SIMD widths. I don't have this problem with the hardware I am targeting.

Rust std::simd docs aren't great and there have been some breaking changes in the few years I've used it. There is certainly more work on that front. But it would be great if at least the basic stuff would get stabilized soon.

Replies

dzaima • last Saturday at 12:34 PM

Overly-wide vectors I'd say are a pretty poor choice in general.

If you're using shuffles at times, you must use native-width vectors to be able to apply them.

If you're doing early-exit loops, you also want the vector width to be quite small to not do useless work.

f64x64 is presumably an exaggeration, but an important note is that overly long vectors will result in overflowing the register file and thus will make tons of stack spills. A single f64x64 takes up the entire AVX2 or ARM NEON register file! There's not really much room for a "widest" vector - SSE only has a tiny 2048-bit register file, the equivalent of just four AVX-512 registers, 1/8th of its register file.

And then there's the major problem that using fixed-width vectors will end up very badly for scalable vector architectures, i.e. ARM SVE and RISC-V RVV; of course not a big issue if you do a native build or do dynamic dispatch, but SVE and RVV are specifically made such that you do not have to do a native build nor duplicate code for different hardware vector widths.

And for things that don't do fancy control flow or use specialized instructions, autovectorization should cover you pretty well anyway; if you have some gathers or many potentially-aliasing memory ranges, on clang & gcc you can _Pragma("clang loop vectorize(assume_safety)") _Pragma("GCC ivdep") to tell the compiler to ignore aliasing and vectorize anyway.

➕ show 1 reply

MangoToupe • last Saturday at 11:22 AM

> there should've been a core SIMD type with special type checking rules

What does this mean?

alt Hacker News

Replies