I suspected this was because the vector units were not wide enough, and it seems that is the case. AVX2 is 256-bit, ARM NEON is only 128-bit.
The big question then is, why are ARM desktop (and server?) cores so far behind on wider SIMD support? It's not like Intel/AMD came up with these extensions for x86 yesterday; AVX2 is over 15 years old.
Wider SIMD is a solution in search of a problem in most cases.
If your code can go wide and has few branches (uses SIMD basically every cycle), either a GPU or matrix co-processor will handily beat the performance of several CPU cores all running together.
If your code can go wide, but is branchy (uses bursts of SIMD between branches), wider becomes even less worth it. If it takes 4 cycles to put through a 256-bit SIMD instruction and you have some branches between the next one, using a 128-bit SIMD with 2 instructions will either have them execute in parallel at the same 4 cycles or even in the worst case, they will pipeline to 5 cycles (that's just a single instruction bubble in the FPU pipeline).
You can increase this differential by going to a 512-bit pipeline, but if it's just occasional 512-bit, you can still match with 4 SIMD units (The latest couple of ARM cores have 6 SIMD units) and while pipelining out from 4 to 7 cycles means you need at least 3-cycle bubbles to break even, this still doesn't seem too unusual.
The one area where this seems to be potentially untrue is simulations working with loads of f64 numbers which can consistently achieve high density with code just branchy enough to make GPUs be inefficient. Most of these workloads are running on supercomputers though and the ARM competitor here is the Fujitsu A64FX which does have 512-bit SVE.
It's also worth noting that even modern x86 chips (by both AMD and Intel) seem to throttle under heavy 512-bit multi-core workloads. Reducing the clockspeed in turn reduces the integer performance which may make applications slower in some cases
All of this is why ARM/Qualcomm/Apple's chips with 128-bit SIMD and a couple AMX/SME units are very competitive in most workloads even though they seem significantly worse on paper.
SVE was supposed to be the next step for ARM SIMD, but they went all-in on runtime variable width vectors and that paradigm is still really struggling to get any traction on the software side. RISC-V did the same thing with RVV, for better or worse.
ARM favored wider ILP and mostly symmetric ALUs, while x86 favored wider and asymmetric ALUs
Most high-end ARM cores were 4x128b FMA, and Cortex-X925 goes to 6x128b FMA. Contrast that to Intel that was 2x256b FMA for the longest, then 2x512b FMA, with another 1-2 pipelines that can't do FMA.
But ultimately, 4x128b ≈ 2x256b, and 2x256b < 6x128b < 2x512b in throughput. Permute is a different factor though, if your algorithm cares about it.
Well, you can always use a Fujitsu A64FX...let me check eBay.. :-)
AVX2 isn't really 256-bit. Its 2x128-bit.
Hasn't there been issues with AVX2 causing such a heavy load on the CPU that frequency scaling would kick in a lot of cases slowing down the whole CPU?
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Dow...
My experience is that trying to get benefits from the vector extensions is incredibly hard and the use cases are very narrow. Having them in a standard BLAS implementation, sure, but outside of that I think they are not worth the effort.
> The big question then is, why are ARM desktop (and server?) cores so far behind on wider SIMD support?
Very wide SIMD instructions require a lot of die space and a lot of power.
The AVX-512 implementation in Intel's Knight's Landing took up 40% of the die area (Source https://chipsandcheese.com/p/knights-landing-atom-with-avx-5... which is an excellent site for architectural analysis)
Most ARM desktop/mobile parts are designed to be low power and low cost. Spending valuable die space on large logic blocks for instructions that are rarely used isn't a good tradeoff for consumer apps.
Most ARM server parts are designed to have very high core counts, which requires small individual die sizes. Adding very wide SIMD support would grow die space of individual cores a lot and reduce the number that could go into a single package.
Supporting 256-bit or 512-bit instructions would be hard to do without interfering with the other design goals for those parts.
Even Intel has started dropping support for the wider AVX instructions in their smaller efficiency cores as a tradeoff to fit more of them into the same chip. For many workloads this is actually a good tradeoff. As this article mentions, many common use cases of high throughput SIMD code are just moving to GPUs anyway.