logoalt Hacker News

mtkleintoday at 4:31 PM2 repliesview on HN

If I remember correctly, the AVX2 feature set is a fairly direct upscale of SSE4.1 to 256 bit. Very few instructions even allowed interaction between the top and bottom 128 bits, I assume to make implementation on existing 128 bit vector units easier. And the most notable new things that AVX2 added beyond that widening, fp16 conversion and FMA support, are also present in NEON, so I wouldn't expect that to be the issue either.

So I'd bet the issue is either newness of the codebase, as the article suggests, or perhaps that it is harder to schedule the work in 256 bit chunks than 128. It's got to be easier when you've got more than enough NEON q registers to handle the xmms, harder when you've got only exactly enough to pair up for handling ymms?


Replies

spacecadet_today at 5:00 PM

> Very few instructions even allowed interaction between the top and bottom 128 bits

That would be plain AVX, AVX2 has shuffles across the 128-bit boundary. To me that seems like the main hurdle for emulation with 128-bit vectors, in my experience compilers are very eager to emit shuffle instructions if allowed, and emulating a 256-bit shuffle with 128-bit operations would require 2 shuffles and a blend for each half of the emulated register.

EDIT: I just noticed that the benchmark in the article is pure math which probably wouldn't hit this particular issue, so this doesn't explain the performance difference...

ack_completetoday at 4:52 PM

There are also mode switching and calling convention issues.

The way that the vector registers were extended to 256-bit causes problems when legacy 128-bit and 256-bit ops are mixed. Doing so puts the CPU into a mode where all legacy 128-bit ops are forced to blend the high half, which can reduce throughput of existing SSE2-based library routines to as low as 1/4 throughput. For this reason, AVX code has to aggressively use the VZEROUPPER instruction to ensure that the CPU is not left in AVX 256-bit vector mode before possibly returning to any library or external code that uses SSE2. VZEROUPPER sets a flag to zero the high half of all 256-bit registers, so it's cheap on modern x86 CPUs but can be expensive to emulate without hardware support.

The other problem is that only the low 128 bits of vector registers are preserved across function calls due to the Windows x64 calling convention and the VZEROUPPER issue. This means that practically any call to external code forces the compiler to spill all AVX vectors to memory. Ideally 256-bit vector usage is concentrated in leaf routines so this isn't an issue, but where used in non-leaf routines, it can result in a lot of memory traffic.