I have written a lot of SIMD for both x86 and ARM over many years and many microarchitectures. Every...

jandrewrogers • today at 3:18 AM • 6 replies • view on HN

I have written a lot of SIMD for both x86 and ARM over many years and many microarchitectures. Every abstraction, including autovectorization, is universally pretty poor outside of narrow cases because they don’t (and mostly can’t) capture what is possible with intrinsics and their rather extreme variation across microarchitectures. If I want good results, I have to write intrinsics. No library can optimally generate non-trivial SIMD code. Neither can the compiler. Portability just amplifies this gap.

I think a legitimate criticism is that it is unclear who std::simd is for. People that don’t use SIMD today are unlikely to use std::simd tomorrow. At the same time, this does nothing for people that use SIMD for serious work. Who is expected to use this?

The intrinsics are not difficult but you do have to learn how the hardware works. This is true even if you are using a library. A good software engineer should have a rough understanding of this regardless.

Replies

mgaunard • today at 3:31 AM

For me the main issue is that if you're serious about SIMD, you need to use a state-of-the-art library and can't rely on some standard library whose quality is variable, unreliable, and which is by design always behind.

➕ show 1 reply

cortesoft • today at 4:10 AM

Is this a technical impossibility or just it hasn't been done yet? Could a library support generating intrinsics for a large set of architectures?

➕ show 3 replies

mpyne • today at 3:52 AM

> I think a legitimate criticism is that it is unclear who std::simd is for.

I think it's for people like me, who recognize that depending on the dataset that a lot of performance is left on the table for some datasets when you don't take advantage of SIMD, but are not interested in becoming experts on intrinsics for a multitude of processor combinations.

Having a way to be able to say "flag bytes in this buffer matching one of these five characters, choose the appropriate stride for the actual CPU" and then "OR those flags together and do a popcount" (as I needed to do writing my own wc(1) as an exercise), and have that at least come close to optimal performance with intrinsics would be great.

Just like I'd rather use a ranged-for than to hand count an index vs. a size.

> People that don’t use SIMD today are unlikely to use std::simd tomorrow.

I mean, why not? That's exactly my use case. I don't use SIMD today as it's a PITA to do properly despite advancements in glibc and binutils to make it easier to load in CPU-specific codes. And it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions. But it is legitimately important for improving performance for many workloads, so I don't want to miss it where it will help.

And even gaining 60, 70% of the "optimal" SIMD still puts you much closer to highest performance that the alternative.

In the end I did end up having to write some direct SIMD intrinsics, I forget what issue I'd run into starting off with std::simd, but std::simd was what had made that problem seem approachable for the first time.

➕ show 2 replies

synergy20 • today at 4:40 AM

what about Google highway project?

paulddraper • today at 4:08 AM

> I think a legitimate criticism is that it is unclear who std::simd is for

It's for people that don't use SIMD today.

SIMD is hard, or at least nuanced and platform-dependant. To say that std::simd doesn't lower the learning curve is intellectually dishonest.

---

Despite the title, the primary criticism of the article is that the compilers' auto-vectorizers have improved better than the current shipped stdlib version.

➕ show 1 reply

kent-tokyo • today at 4:44 AM

[dead]

alt Hacker News

Replies