logoalt Hacker News

exDM69last Saturday at 2:00 PM1 replyview on HN

> if you don't describe your code and dataflow in a way that caters to the shape of the SIMD

But when I do describe code, dataflow and memory layout in a SIMD friendly way it's pretty much the same for x86_64 and ARM.

Then I can just use `a + b` and `f32x4` (or its C equivalent) instead of `_mm_add_ps` and `_mm128` (x86_64) or `vaddq_f32` and `float32x4_t` (ARM).

Portable SIMD means I don't need to write this code twice and memorize arcane runes for basic arithmetic operations.

For more specialized stuff you have intrinsics.


Replies

owlbitelast Saturday at 2:26 PM

So we write a lot of code in this agnostic fashion using typedef's and clang's vector attribute support, along with __builtin_shufflevector for all the permutations (something along similar lines to Apple's simd.h). It works pretty well in terms of not needing to memorize/lookup all the mnemonic intrinsics for a given platform, and letting regular arithmetic operations exist.

However, we still end up writing different code for different target SOCs, as the microarchitecture is different, and we want to maximize our throughput and take advantage of any ISA support for dedicated instructions / type support.

One big challenge is targeting in-order cores the compiler often does a terrible job of register allocation (we need to use pretty much all the architectural registers to allow for vector instruction latencies), so we find the model breaks down somewhat there as we have to drop to inline assembly.

show 1 reply