SIMD performance in modern Intel and AMD cpus is so bad that it is useless outside very specific circumstances.
This is mainly because vector instructions are implemented by sharing resources with other parts of the CPU and more or less stalls pipelines, significantly reduces ipc, makes out of order execution ineffective.
The shared resources are often involve floating point registers and compute, so it's a double whammy.
Yet, it is still faster than not doing nothing, or calling into the GPU, on workloads where the bus traffic takes the majority of execution time.