Why we need SIMD

121 points • by atan2 • last Monday at 3:10 AM • 38 comments • view on HN

Comments

The main problem with SIMD instructions is that regular code doesn't use them. Almost always someone need to write SIMD code manually to achieve good performance, which is rarely done and if so, only in some tight loops and niche cases. Like cryptography-related code in a browser may be SIMD-based, but other code uses almost no SIMD.

Modern compilers are able sometimes to vectorize regular code, but this is done only occasionally, since compilers can't often prove that read/write operations will access valid memory regions. So one still needs to write his code in such a way that compiler can vectorize it, but such approach isn't reliable and it's better to use SIMD instruction directly to be sure.

Remnant44 • yesterday at 7:43 PM

I'm just happy that finally, with the popularity of zen4 and 5 chips, AVX512 is around ~20% of the running hardware in the steam hardware survey. It's going to be a long while before it gets to a majority - Intel still isn't shipping its own instruction set in consumer CPUs - but its going the right direction.

Compared to the weird, lumpy lego set of avx1/2, avx512 is quite enjoyable to write with, and still has some fun instructions that deliver more than just twice the width.

Personal example: The double width byte shuffles (_mm512_permutex2var_epi8) that takes 128 bytes as input in two registers. I had a critical inner loop that uses a 256 byte lookup table; running an upper/lower double-shuffle and blending them essentially pops out 64 answers a cycle from the lookup table on zen5 (which has two shuffle units), which is pretty incredible, and on its own produced a global 4x speedup for the kernel as a whole.

➕ show 1 reply

lordnacho • yesterday at 7:12 PM

When I optimize stuff, I just think of the SIMD instructions as a long sandwich toaster. You can have a normal toaster that makes one sandwich, or you can have a 4x toaster that makes 4 sandwiches as once. If you have a bunch of sandwiches to make, obviously you want to align your work so that you can do 4 at a time.

If you want to make 4 at a time though, you have to keep the thing fed. You need your ingredients in the cache, or you are just going to waste time finding them.

jasonthorsness • yesterday at 7:46 PM

Compared to GPU programming the gains from SIMD are limited but it's a small-multiple boost and available pretty much everywhere. C# makes it easy to use through Vector classes. WASM SIMD still has a way to go but even with the current 128-bit you can see dramatic improvements in some buffer-processing cases (I did a little comparison demo here showing a 20x improvement in bitwise complement of a large buffer: https://www.jasonthorsness.com/2)

➕ show 4 replies

p0nce • yesterday at 8:53 PM

4 lanes of SIMD (like in say SSE) is not necessarily 4x faster because of the memory access, sometimes it's better than that (and often it's less).

PSHUFB wins in case of unpredictable access patterns. Though I don't remember how much it typically wins.

PMOVMSKB can replace several conditionals (up to 16 in SSE2 for byte operands) with only one, winning in terms of branch prediction.

PMADDWD is in SSE2, and does 8 byte multiplies not 4. SSE4.1 FP rounding that doesn't require changing the rounding mode, etc. The weird string functions in SSE4.2. Non-temporal moves and prefetching in some cases.

The cool thing with SIMD is that it's a lot less stress for the CPU access prediction and branch prediction, not only ALU. So when you optimize it will help unrelated parts of your code to go faster.

dpifke • yesterday at 10:50 PM

Related: Go is looking to add SIMD intrinsics, which should provide a more elegant way to use SIMD instructions from Go code: https://go.dev/issue/73787

chasil • yesterday at 8:40 PM

The author has neglected the 3DNow! SIMD instructions from AMD.

They were notable for several reasons, although they are no longer included in modern silicon.

https://en.wikipedia.org/wiki/3DNow!

vardump • yesterday at 7:13 PM

Wider SIMD would be useful, especially with AVX-512 style improvements. 1024 or even 2048 bits wide operations.

Of course memory bandwidth should increase proportionally otherwise the cores might have no data to process.

➕ show 5 replies

dang • yesterday at 7:35 PM

Recent and related:

Why do we even need SIMD instructions? - https://news.ycombinator.com/item?id=44850991 - Aug 2025 (8 comments)

kristianp • yesterday at 8:53 PM

No mention of branches, which is a complementary concept. If you unwind your loop, you can get part of the way to SIMD performance by keeping the CPU pipeline filled.

aboardRat4 • today at 3:48 AM

Why does such an abbreviation still exist in 2025?

They have been in the CPUs for so long that I expected them to be inseparable to the degree that people wouldn't even remember they were a separate thing in the past.

alt Hacker News

Why we need SIMD

Comments