> C++ is really good at optimizing for memory bandwidth
In general, most modern CPU thread-safe code is still a bodge in most languages. If folks are unfortunate enough to encounter inseparable overlapping state sub-problems, than there is no magic pixie dust to escape the computational cost. On average, attempting to parallelize this type of code can end up >30% slower on identical hardware, and a GPU memory copy exchange can make it even worse.
Sometimes even compared to a large multi-core CPU, a pinned-core higher clock-speed chip will win out for those types of problems.
Thus, the mystery why most people revert to batching k copies of single-core-bound non-parallel version of a program was it reduces latency, stalls, cache thrashing, i/o saturation, and interprocess communication costs.
Exchange costs only balloon higher across networks, as however fast the cluster partition claims to be... the physics is still going to impose space-time constraints, as modern data-centers will spend >15% of energy cost just moving stuff around networks for lower efficiency code.
I like languages like Julia, as it implicitly abstracts the broadcast operator to handle which areas may be cleanly unrolled. However, much like Erlang/Elixir the multi-host parallelization is not cleanly implemented... yet...
The core problem with HPC software, has always been academics are best modeled like hermit-crabs with facilities. Once a lucky individual inherits a nice new shell, the pincers come out to all smaller entities who may approach with competing interests.
Best of luck, =3
"Crabs Trade Shells in the Strangest Way | BBC Earth"