Except for GPU manufacturers, who figured out the right way to do it 20 years ago.
20 years ago, it was extremely obvious to anyone who had to write forward/backward compatible parallelism that the-thing-nvidia-calls-SIMT was the correct approach. I thought CPU hardware manufacturers and language/compiler writers were so excessively stubborn that it would take them a decade to catch up. I was wrong. 20 years on, they still refuse to copy what works.
Intel actually built and launched this 15 years ago. A GPU-like barrel processor with tons of memory bandwidth and wide vector instructions that ran x86. In later versions they even addressed the critical weakness of GPUs for many use cases (poor I/O bandwidth).
It went nowhere, aside from Intel doing Intel things, because most programmers struggle to write good code for those types of architectures so all of that potential was wasted.
That's because the things GPUs do just isn't what CPUs do. GPUs don't have to deal with ad-hoc 10..100-char strings. They don't have to deal with 20-iteration loops with necessarily-serial dependencies between invocations of the small loops. They don't have to deal with parallelizing mutable hashmap probing operations.
Indeed what GPUs do is good for what GPUs do. But we have a tool for doing things that GPUs do well - it's called, uhh, what's it, uhh... oh yeah, GPUs. Copying that into CPUs is somewhere between just completely unnecessary, and directly harmful to things that CPUs are actually supposed to be used for.
The GPU approach has pretty big downsides for anything other than the most embarassingly-parallel code on very massive inputs; namely, anything non-trivial (sorting, prefix sum) will typically require log(n) iterations, and somewhere between twice as much, and O(n*log(n)) memory access (and even summing requires stupid things like using memory for an accumulator instead of being able to just use vector registers), compared to the CPU SIMD approach of doing a single pass with some shuffles. GPUs handle this via trading off memory latency for more bandwidth, but any CPU that did that would immediately go right in the thrash because that'd utterly kill scalar code performance.