logoalt Hacker News

jandrewrogerstoday at 3:48 AM4 repliesview on HN

For some algorithms you have to compromise the data layout for compatibility across the widest number of microarchitectures by nerfing the performance on advanced SIMD microarchitectures working on the same data structures. There really isn’t a way to square that circle. You can make it portable or you can make it optimal, and the performance gap across those two implementations can be vast.

In the 15-20 years I’ve been doing it, I’ve seen zero evidence that there is a solution to this tradeoff. And people that are using SIMD are people that care about state-of-the-art performance, so portability takes a distant back seat.


Replies

camel-cdrtoday at 5:30 AM

The data layout can often be done dynamically based on your target architecture.

mgaunardtoday at 4:39 AM

For Boost.SIMD (which is what became Eve), a large part of what we did to tackle those problems was building an overload dispatching system so that we could easily inject increasingly specialized implementations depending on the types and instruction set available, in such a way that operations could combine efficiently.

That, however, performed quite poorly at compile-time, and was not really ODR-safe (forceinline was used as a workaround). At least one of the forks moved to using a dedicated meta-language and a custom compiler to generate the code instead. There are better ways to do that in modern C++ now.

We also focused on higher-level constructs trying to capture the intent rather than trying to abstract away too low-level features; some of the features were explicitly provided as kernels or algorithms instead of plain vector operations.

mattiptoday at 4:25 AM

NumPy has a whole dispatch mechanism to deal with the tradeoffs. The main problem is code bloat: how many microarchitectures are you going to support with dispatch at runtime?