logoalt Hacker News

SIMD programming in pure Rust

92 pointsby randomint64last Monday at 9:39 AM37 commentsview on HN

Comments

pizlonatortoday at 1:55 AM

This article references the fact that security issues in crypto libs are memory safety issues, and I think this is meant to be a motivator for writing the crypto using SIMD intrinsics.

This misses two key issues.

1. If you want to really trust that your crypto code has no timing side channels, then you've gotta write it in assembly. Otherwise, you're at the compiler's whims to turn code that seems like it really should be constant-time into code that isn't. There's no thorough mechanism in compilers like LLVM to prevent this from happening.

2. If you look at the CVEs in OpenSSL, they are generally in the C code, not the assembly code. If you look at OpenSSL CVEs going back to the beginning of 2023, there is not a single vulnerability in Linux/X86_64 assembly. There are some in the Windows port of the X86_64 assembly (because Windows has a different calling conv and the perlasm mishandled it). There are some on other arches. But almost all of the CVEs are in C, not asm.

If you want to know a lot more about how I think about this, see https://fil-c.org/constant_time_crypto

I do think it's a good idea to have crypto libraries implemented in memory safe languages, and that may mean writing them in Rust. But the actual kernels that do the cryptographic computations that involve secrets should be written in asm for maximum security so that you can be sure that sidechannels are avoided and because empirically, the memory safety bugs are not in that asm code.

show 2 replies
shihabyesterday at 11:30 PM

> For example, NEON ... can hold up to 32 128-bit vectors to perform your operations without having to touch the "slow" memory.

Something I recently learnt: the actual number of physical registers in modern x86 CPUs are significantly larger, even for 512-bit SIMD. Zen 5 CPUs actually have 384 vectors registers, 384*512b = 24KB!

show 2 replies
rwaksmunskilast Monday at 2:40 PM

Every Rust SIMD article should mention the .chunks_exact() auto vectorization trick by law.

show 1 reply
dfajgljsldkjagyesterday at 10:52 PM

The benchmarks on Zen 5 are absolutely insane for just a bit of extra work. I really hope the portable SIMD module stabilizes soon, so we do not have to keep rewriting the same logic for NEON and AVX every time we want to optimize something. That example about implementing ChaCha20 twice really hit home for me.

vintagedavetoday at 9:30 AM

I was surprised not to see Sleef mentioned as an option. It’s available for Rust and handles the architecture agnostic or portable needs they have.

https://docs.rs/sleef/latest/sleef/

crotelast Monday at 11:55 AM

What is the "nasty surprise" of Zen 4 AVX512? Sure, it's not quite the twice as fast you might initially assume, but (unlike Intel's downclocking) it's still a strict upgrade over AVX2, is it not?

jeffbeeyesterday at 11:00 PM

"Intel CPUs were downclocking their frequency when using AVX-512 instructions due to excessive energy usage (and thus heat generation) which led to performance worse than when not using AVX-512 acceleration."

This is an overstatement so gross that it can be considered false. On Skylake-X, for mixed workloads that only had a few AVX-512 instructions, a net performance loss could have happened. On Ice Lake and later this statement was not true in any way. For code like ChaCha20 it was not true even on Skylake-X.

show 3 replies
wyldfiretoday at 1:21 AM

Does Rust provide architecture-specific intrinsics like C/C++ toolchains usually do? That's a popular way to do SIMD.

show 2 replies
nice_bytetoday at 3:04 AM

it's hard to believe that using simd extensions in rust is still as much of a chaotic clusterfudge as it was the first time I looked into it. no support in standard library? 3 different crates? might as well write inline assembly...

show 1 reply
formerly_provenyesterday at 10:20 PM

Lazy man's "kinda good enough for some cases SIMD in pure Rust" is to simply target x86-64-v3 (RUSTFLAGS=-Ctarget-cpu=x86-64-v3), which is supported by all AMD Zen and Intel CPUs since Haswell; and for floating point code, which cannot be auto-vectorized due to the accuracy implications, "simply" write it with explicit four or eight-way lanes, and LLVM will do the rest. Usually. Loops may need explicit handling of head or tail to auto-vectorize (chunks_exact helps with this, it hands you the tail).