Wider SIMD is a solution in search of a problem in most cases. If your code can go wide and has fe...

hajile • today at 5:59 PM • 1 reply • view on HN

Wider SIMD is a solution in search of a problem in most cases.

If your code can go wide and has few branches (uses SIMD basically every cycle), either a GPU or matrix co-processor will handily beat the performance of several CPU cores all running together.

If your code can go wide, but is branchy (uses bursts of SIMD between branches), wider becomes even less worth it. If it takes 4 cycles to put through a 256-bit SIMD instruction and you have some branches between the next one, using a 128-bit SIMD with 2 instructions will either have them execute in parallel at the same 4 cycles or even in the worst case, they will pipeline to 5 cycles (that's just a single instruction bubble in the FPU pipeline).

You can increase this differential by going to a 512-bit pipeline, but if it's just occasional 512-bit, you can still match with 4 SIMD units (The latest couple of ARM cores have 6 SIMD units) and while pipelining out from 4 to 7 cycles means you need at least 3-cycle bubbles to break even, this still doesn't seem too unusual.

The one area where this seems to be potentially untrue is simulations working with loads of f64 numbers which can consistently achieve high density with code just branchy enough to make GPUs be inefficient. Most of these workloads are running on supercomputers though and the ARM competitor here is the Fujitsu A64FX which does have 512-bit SVE.

It's also worth noting that even modern x86 chips (by both AMD and Intel) seem to throttle under heavy 512-bit multi-core workloads. Reducing the clockspeed in turn reduces the integer performance which may make applications slower in some cases

All of this is why ARM/Qualcomm/Apple's chips with 128-bit SIMD and a couple AMX/SME units are very competitive in most workloads even though they seem significantly worse on paper.

Replies

dlcarrier • today at 6:46 PM

Video encoding and image compression is a huge use case, and not at all uncommon, so much so that a lot of hardware has dedicated hardware for it. Of course, offloading the SIMD instructions to dedicated hardware accelerators does reduce usage of SIMD instructions, but any time a specific CODEC or algorithm isn't accelerated, then the SIMD instructions are absolutely necessary.

Emulators also use them a lot, often in unintended ways, because they are very flexible. This is partially because the emulator itself can use the flexibility to optimize emulation, but also because hand optimizing with SIMD instruction can significantly improve performance of any application, which is necessary for the low-performance processors common in videogame consoles.

alt Hacker News

Replies