If you think C++ is the best here, then I don't think you've actually worked in this space nor appreciated the actual problems these languages try to solve. In particular because you can't program accelerators with C++.
Memory bandwidth is often the problem, yes. Language abstractions for performance aim to, e.g., automatically manage caches (that must be handled manually in performant GPU code, for instance) with optimized memory tiling and other strategies. Kernel fusion is another nontrivial example that improves effective bandwidth.
Adding on the diversity of hardware that one needs to target (both within and among vendors), i.e., portability not just of function but of performance, makes the need for better tooling abundantly obvious. C++ isn't even an entrant in this space.
Wait what!? I have been programming CUDA since 2009 and specifically remember it being pushed to C++ as main development language for the first few years, after a brief "CUDA C extension" period.
What?!?
NVidia designs CUDA hardware specifically for the C++ memory model, they went through the trouble to refactor their original hardware across several years, so that all new cards would follow this model, even if PTX was designed as polyglot target.
Additionally, ISO C++ papers like senders/receivers are driven by NVidia employees working on CUDA.