Overly-wide vectors I'd say are a pretty poor choice in general. If you're using shuffle...

dzaima • last Saturday at 12:34 PM • 1 reply • view on HN

Overly-wide vectors I'd say are a pretty poor choice in general.

If you're using shuffles at times, you must use native-width vectors to be able to apply them.

If you're doing early-exit loops, you also want the vector width to be quite small to not do useless work.

f64x64 is presumably an exaggeration, but an important note is that overly long vectors will result in overflowing the register file and thus will make tons of stack spills. A single f64x64 takes up the entire AVX2 or ARM NEON register file! There's not really much room for a "widest" vector - SSE only has a tiny 2048-bit register file, the equivalent of just four AVX-512 registers, 1/8th of its register file.

And then there's the major problem that using fixed-width vectors will end up very badly for scalable vector architectures, i.e. ARM SVE and RISC-V RVV; of course not a big issue if you do a native build or do dynamic dispatch, but SVE and RVV are specifically made such that you do not have to do a native build nor duplicate code for different hardware vector widths.

And for things that don't do fancy control flow or use specialized instructions, autovectorization should cover you pretty well anyway; if you have some gathers or many potentially-aliasing memory ranges, on clang & gcc you can _Pragma("clang loop vectorize(assume_safety)") _Pragma("GCC ivdep") to tell the compiler to ignore aliasing and vectorize anyway.

Replies

exDM69 • last Saturday at 1:02 PM

> f64x64 is presumably an exaggeration

It's not. IIRC, 64 elements wide vectors are the widest that LLVM (or Rust, not sure) can work with. It will happily compile code that uses wider vectors than the target CPU has and split accordingly.

That doesn't necessarily make it a good idea.

Autovectorization works great for simple stuff and has improved a lot in the past decade (e.g. SIMD gather loads).

It doesn't work great for things like converting a matrix to quaternion (or vice versa), and then doing that in a loop. But if you write the inner primitive ops with SIMD you get all the usual compiler optimizations in the outer loop.

You should not unroll the outer loop like in the Quake 3 days. The compiler knows better how many times it should be unrolled.

I chose this example because I recently ported the Quake 3 quaternion math routines to Rust for a hobby project and un-unrolled the loops. It was a lot faster than the unrolled original (thanks to LLVM, same would apply to Clang).

➕ show 1 reply

alt Hacker News

Replies