It’s such a ridiculous situation we’re in. Just about every consumer CPU of the past 20 years packs an extra order of magnitude or two of punch for data processing workloads, but to not let it go to waste you have to resort to writing your inner loops using low-level nonportable intrinsics that are just a step above assembly. Or pray that the gods of autovectorization are on your side.
Evidence point towards the autovectorization gods being dead, false, or too weak. I hear, but don't believe their prophets.
Well, yea. You need to describe your data flow in a way the CPU can take advantage of it. Compilers aren't magic.
Adding parallelism is much easier on the hardware side than the software side. We've kind of figured out the easy cases, such as independent tasks with limited interactions with each other, and made them practical for the average developer. But nobody understands the harder cases (such as SIMD) well enough to create useful abstractions that don't constrain hardware development too much.