Sorry, I wasn't aware of these developments (having abandoned CUDA for hardware-agnostic solutions before 2020). It doesn't change my point anyway, if it's specific to a single vendor.
I'm extremely dubious that such an opaque abstraction can actually solve the (true) problem. "Not having to write CUDA" is not enough - how do you tune performance? Parallelization strategies, memory prefetching and arrangement in on-chip caches, when to fuse kernels vs. not... I don't doubt the compiler can do these things, but I do doubt that it can know at compile time what variants of kernel transformations will optimize performance on any given hardware. That's the real problem: achieving an abstraction that still gives one enough control to achieve peak performance.
Edit: you tell me if I'm wrong, but it seems that std::par can't even use shared memory, let alone let one control its usage? If so, then my point stands: C++ is not remotely relevant. Again, avoiding writing CUDA (etc.) doesn't solve the real problem that high-performance language abstractions aim to address.
So what would be such an HPC language that you're so fond of? A quick web search reveals only languages that use C++/CUDA code as a back end (python), are new and experimental (Julia) or FORTRAN. For what you're talking about none seem all to good, so you've peaked my curiosity.