AVX2 is slower than SSE2-4.x under Windows ARM emulation

89 points • by vintagedave • today at 2:08 PM • 78 comments • view on HN

Comments

> AVX2 level includes FMA (fast multiply-add)

FMA acronym is not fast multiply add, it’s fused multiply add. Fused means the instruction computes the entire a * b + c expression using twice as many mantissa bits, only then rounds the number to the precision of the arguments.

It might be the Prism emulator failed to translate FMA instructions into a pair of two FMLA instructions (equally fused ARM64 equivalent), instead it did some emulation of that fused behaviour, which in turn what degraded the performance of the AVX2 emulation.

➕ show 1 reply

kbolino • today at 2:36 PM

I suspected this was because the vector units were not wide enough, and it seems that is the case. AVX2 is 256-bit, ARM NEON is only 128-bit.

The big question then is, why are ARM desktop (and server?) cores so far behind on wider SIMD support? It's not like Intel/AMD came up with these extensions for x86 yesterday; AVX2 is over 15 years old.

➕ show 8 replies

TheJoeMan • today at 3:25 PM

I tried searching "SSE2-4.x" and this is the top result in DDG and Google, so I was initially confused what instruction set the article is referring to. However, this appears to be shorthand for SSE2 through SSE4? Perhaps a rephrasing of the article title could be helpful.

➕ show 2 replies

Aissen • today at 2:33 PM

Spoiler is in the conclusion:

> Yes, it is absolutely key to build your app as ARM, not to rely on Windows ARM emulation.

➕ show 3 replies

mtklein • today at 4:31 PM

If I remember correctly, the AVX2 feature set is a fairly direct upscale of SSE4.1 to 256 bit. Very few instructions even allowed interaction between the top and bottom 128 bits, I assume to make implementation on existing 128 bit vector units easier. And the most notable new things that AVX2 added beyond that widening, fp16 conversion and FMA support, are also present in NEON, so I wouldn't expect that to be the issue either.

So I'd bet the issue is either newness of the codebase, as the article suggests, or perhaps that it is harder to schedule the work in 256 bit chunks than 128. It's got to be easier when you've got more than enough NEON q registers to handle the xmms, harder when you've got only exactly enough to pair up for handling ymms?

➕ show 2 replies

LeoNatan25 • today at 6:13 PM

Any equivalent look at Apple's Rosetta 2? Perhaps if author has time and availability of hardware, they can have a similar look. Rosetta 2 is going away next year, and it's a shame, even if from a purely technical reason. Apple will never open source it.

➕ show 1 reply

crest • today at 3:12 PM

I wouldn't be surprised for SSE4 to be the fastest cause it's easiest to map to NEON as both use 128 bit registers and offer a fairly simlar feature set.

➕ show 1 reply

Mickell • today at 3:51 PM

[flagged]

targettracker • today at 3:45 PM

[flagged]

iberator • today at 2:27 PM

AVX2 should be banned anyway. Only expensive CPUs have it, ruining mininum games requirements and making hardware obsolete.

Most of the world lives of 300$ per month

➕ show 5 replies

alt Hacker News

AVX2 is slower than SSE2-4.x under Windows ARM emulation

Comments