It's really not. As an example, for string processing tasks (including codecs which various ser...

anonymoushn • yesterday at 1:59 PM • 1 reply • view on HN

It's really not. As an example, for string processing tasks (including codecs which various server software spends a significant percentage of its runtime on), NEON includes a deinterleaving load into 4 registers and byte-wise shuffles that accept 2, 3, or 4 registers worth of lookup table. These primitives are quite different from those available on AVX2 or AVX-512, and the fact that they are available and cheap to use means you end up with somewhat different algorithms for the two types of targets. Even the practice of using the toys available in AVX2 well for this sort of task is somewhat obscure. Folks who have worked on codec-type stuff but primarily used AVX-512 often have trouble figuring out how to do most of the same things in similar instruction counts if masked versions of the instructions are not available.

Replies

janwas • yesterday at 2:23 PM

I made the same argument a while ago but a coworker changed my mind.

Can you afford to write and maintain a codepath per ISA (knowing that more keep coming, including RVV, LASX and HVX), to squeeze out the last X%? Is there no higher-impact use of developer time? If so, great.

If not, what's the alternative - scalar code? I'd think decent portable SIMD code is still better than nothing, and nothing (scalar) is all we have for new ISAs which have not yet been hand-optimized. So it seems we should anyway have a generic SIMD path, in addition to any hand-optimized specializations.

BTW, Highway indeed provides decent emulations of LD2..4, and at least 2-table lookups. Note that some Arm uarchs are anyway slow with 3 and 4.

➕ show 1 reply

alt Hacker News

Replies