re edit - While the hardware differences are significant, some by necessity, some by tradition, it's not my point.
My specific point is that "how to write code that is generic across block size, boundary conditions, and thread divergence." is just not the correct question to ask for many CPU SIMD use-cases. Many of those Just Do Not Fit That Paradigm. If you think you can just squeeze CPU SIMD usage into that box then I don't think you've actually done any CPU SIMD beyond very trivial things (see my example problem in the other thread).
You want to take advantage of block size on CPUs. It's sad that GPU programmers don't get to. In other places I've seen multiple GPU programmers annoyed at not being able to do the CPU SIMD programming paradigm of explicit registers on GPUs. And doing anything about thread divergence on CPUs is just not gonna go well due to the necessary focus on high clock rates (and as such having branch mispredictions be relatively ridiculously expensive).
You of course don't need any of fancy anything if you have a pure embarassingly-parallel problem, for which GPUs are explicitly made. But for these autovectorization does actually already work, given hardware that has necessary instructions (memory gathers/scatters, masked loads/stores if necessary; and of course no programming paradigm would magically make it work for hardware that doesn't). At worst you may need to add a _Pragma to tell the compiler to ignore memory aliasing, at which point the loop body is exactly the same programming paradigm as CUDA (with thread synchronization being roughly "} for (...) {", but you gain better control over how things happen).