Here's a toy example of f32 vs. f64 generic Rust std::simd code without macros. It aint't pretty but it works.
https://play.rust-lang.org/?version=nightly&mode=debug&editi...
In my projects, I've put a lot of related SIMD math code in a trait. It saves duplicating the monstrous `where` clause in every function declaration. Additionally it allows me to specialize so that `recip_fast` on f32 uses `__mm_rcp_ps` on x86 or `vrecpeq_f32`/`vrecpsq_f32` on ARM (fast reciprocal intrinsic function) but for f64 it's just `x.recip()` (which uses a division). If compiled on other than ARM or x86, it'll also fall back to `x.recip()` for portability (I haven't actually tested this).
The ergonomics here could be better but at least it compiles to exactly the assembly instructions I want it to and using this code isn't ugly at all. Just `x.recip_fast()` instead of `x.recip()`.