logoalt Hacker News

the__alchemistlast Saturday at 12:55 PM1 replyview on HN

I've been dodging the f32/f64-specificity in rust using macros, but I don't love it. I do it because I'm not sure what else to do.

I think `core::simd` is probably not near, but 512-bit AVX SIMD will be out in a month or two! So, you could use f64x8, f32x16.

I've built my own f32xY, Vec3xY etc structs, since I want to use stable rust, but don't want to wait for core::SIMD. These all have syntaxes that mimic core::SIMD. I set up pack/unpack functions too, but they are still a hassle compared to non-SIMD operations.

THis begs the question: If core::simd doesn't end up ideal (For flexibility etc), how far can we get with a thin-wrapper lib? The ideal API (imo) is transparent, and supports the widest instructions available on your architecture, falling back to smaller ones or non-SIMD.

It also begs the question of if you should stop worrying and love the macro... We are talking f32/f64. Different widths. Different architectures. And now conflated with len 3 vecs, len 4 vecs etc (Where each vector/tensor item it a SIMD intrinsic). How does every vector (Not in the SIMD sense; the most popular one uses AoS SIMD which is IMO a mistake) handle this? Macros. This leads me to think we macro this too.


Replies

exDM69last Saturday at 1:51 PM

Here's a toy example of f32 vs. f64 generic Rust std::simd code without macros. It aint't pretty but it works.

https://play.rust-lang.org/?version=nightly&mode=debug&editi...

In my projects, I've put a lot of related SIMD math code in a trait. It saves duplicating the monstrous `where` clause in every function declaration. Additionally it allows me to specialize so that `recip_fast` on f32 uses `__mm_rcp_ps` on x86 or `vrecpeq_f32`/`vrecpsq_f32` on ARM (fast reciprocal intrinsic function) but for f64 it's just `x.recip()` (which uses a division). If compiled on other than ARM or x86, it'll also fall back to `x.recip()` for portability (I haven't actually tested this).

The ergonomics here could be better but at least it compiles to exactly the assembly instructions I want it to and using this code isn't ugly at all. Just `x.recip_fast()` instead of `x.recip()`.