std::execution is very interesting, but will be difficult to get started with, as cautioned by Sutter. This HPC Wire article demonstrates how to use standard C++ to benefit from asynchronously parallel computation on both CUDA and MPI:
https://www.hpcwire.com/2022/12/05/new-c-sender-library-enab...
Overlapping communication and computation has been a common technique for decades in high-performance computing to "hide latency", which leads to better scaling. Now standard C++ can be used to express parallel algorithms without tying to a specific scheduler.