This reads like you haven't tried CUDA. The whole point of CUDA is that your CUDA code has single-threaded semantics. The problem you assumed it has is the problem it doesn't have, and the fact that it doesn't have this problem is the reason why it works and the reason why it wins.
EDIT: ok, now you're talking about the hardware difference between CPUs and GPUs. This is relevant for the types of programs that each can accelerate -- barrel processors are uniquely suited to embarrassingly parallel problems, obviously -- but it is not relevant for the question of "how to write code that is generic across block size, boundary conditions, and thread divergence." CUDA figured this out, non-embarassingly-parallel programs still have this problem, and they should copy what works. The best time to copy what works was 20 years ago, but the second best time is today.
re edit - While the hardware differences are significant, some by necessity, some by tradition, it's not my point.
My specific point is that "how to write code that is generic across block size, boundary conditions, and thread divergence." is just not the correct question to ask for many CPU SIMD use-cases. Many of those Just Do Not Fit That Paradigm. If you think you can just squeeze CPU SIMD usage into that box then I don't think you've actually done any CPU SIMD beyond very trivial things (see my example problem in the other thread).
You want to take advantage of block size on CPUs. It's sad that GPU programmers don't get to. In other places I've seen multiple GPU programmers annoyed at not being able to do the CPU SIMD programming paradigm of explicit registers on GPUs. And doing anything about thread divergence on CPUs is just not gonna go well due to the necessary focus on high clock rates (and as such having branch mispredictions be relatively ridiculously expensive).
You of course don't need any of fancy anything if you have a pure embarassingly-parallel problem, for which GPUs are explicitly made. But for these autovectorization does actually already work, given hardware that has necessary instructions (memory gathers/scatters, masked loads/stores if necessary; and of course no programming paradigm would magically make it work for hardware that doesn't). At worst you may need to add a _Pragma to tell the compiler to ignore memory aliasing, at which point the loop body is exactly the same programming paradigm as CUDA (with thread synchronization being roughly "} for (...) {", but you gain better control over how things happen).
It has single-threaded semantics per element. Which is fine for anything that does completely independent computation for each element, but is quite annoying for everything else, requiring major algorithmic changes. And CPU SIMD is used for a lot of such things.