logoalt Hacker News

owlbitelast Saturday at 2:26 PM1 replyview on HN

So we write a lot of code in this agnostic fashion using typedef's and clang's vector attribute support, along with __builtin_shufflevector for all the permutations (something along similar lines to Apple's simd.h). It works pretty well in terms of not needing to memorize/lookup all the mnemonic intrinsics for a given platform, and letting regular arithmetic operations exist.

However, we still end up writing different code for different target SOCs, as the microarchitecture is different, and we want to maximize our throughput and take advantage of any ISA support for dedicated instructions / type support.

One big challenge is targeting in-order cores the compiler often does a terrible job of register allocation (we need to use pretty much all the architectural registers to allow for vector instruction latencies), so we find the model breaks down somewhat there as we have to drop to inline assembly.


Replies

exDM69last Saturday at 2:50 PM

Your experience matches mine, you can get a lot done with the portable SIMD in Clang/GCC/Rust but you can't avoid the platform specific stuff when you need specialized instructions.

Depends on the domain you work in how much you need to resort to platform specific intrinsics. For me dabbling in computer graphics and game physics, almost all of the code is portable except for some rare specialized instructions here and there.

For someone working in specialized domains (like video codecs) or hardware (HPC super computers) the balance might be the other way around.