> f64x64 is presumably an exaggeration
It's not. IIRC, 64 elements wide vectors are the widest that LLVM (or Rust, not sure) can work with. It will happily compile code that uses wider vectors than the target CPU has and split accordingly.
That doesn't necessarily make it a good idea.
Autovectorization works great for simple stuff and has improved a lot in the past decade (e.g. SIMD gather loads).
It doesn't work great for things like converting a matrix to quaternion (or vice versa), and then doing that in a loop. But if you write the inner primitive ops with SIMD you get all the usual compiler optimizations in the outer loop.
You should not unroll the outer loop like in the Quake 3 days. The compiler knows better how many times it should be unrolled.
I chose this example because I recently ported the Quake 3 quaternion math routines to Rust for a hobby project and un-unrolled the loops. It was a lot faster than the unrolled original (thanks to LLVM, same would apply to Clang).
I think LLVM should be able to handle up to less than 65536-bit vectors? At least I saw some llvm issue noting that that's where some things broke down; so a 32kbit vector would be f64x512. But I meant an exaggeration as far as actually using it goes, f64x64 is hilariously overkill even on AVX-512, and utterly awful on pre-AVX512.