No no no! The programming model "meets you where you are" in exactly the way that an auto-vectorizer does. You write unconstrained single-threaded code, the compiler tries to make it parallel, and if it fails your code still works, just slowly. The difference is a few abstractions and social contract tweaks to make the auto-vectorizer reliable and easy to think about. These tweaks "smell like" hacks, but CPU folks have spent 20 years trying to do better and their auto-vectorizers are still failing at the basics so it's past time to copy what works and move on.
I'm so glad someone else gets it. We don't want an auto-vectorizer, it doesn't work, just give us a trivial way to vectorise the easy parts and leave the difficult parts to be difficult. We're better at the difficult stuff than your compiler.
I think you maybe misunderstood what I was trying to say.
A model like CUDA only works well for the problems it works well on. It requires both HW designed for these kinds of problems, a SW stack that can use it, and problems that fit well within that paradigm. It does not work well for problems that aren’t embarrassingly parallel, where you process a little bit of data, make a decision, process a little bit more etc. As an example, go try to write a TCP stack in CUDA vs a normal language to understand the inherent difficulty of such an approach.
And when I say “hw designed for this class of problems” I mean it. Why does the GPU have so much compute? It throws away HW blocks that modern CPUs have that help with “normal” code. Like speculative execution hardware, thread synchronization, etc.
It’s an tradeoffs and there’s no easy answers.