logoalt Hacker News

jandrewrogerslast Friday at 7:49 AM6 repliesview on HN

I can easily explain this, having worked in this space. The new languages don’t actually solve any urgent problems.

How people imagine scalable parallelism works and how it actually works doesn’t have a lot of overlap. The code is often boringly single-threaded because that is optimal for performance.

The single biggest resource limit in most HPC code is memory bandwidth. If you are not addressing this then you are not addressing a real problem for most applications. For better or worse, C++ is really good at optimizing for memory bandwidth. Most of the suggested alternative languages are not.

It is that simple. The new languages address irrelevant problems. It is really difficult to design a language that is more friendly to memory bandwidth than C++. And that is the resource you desperately need to optimize for in most cases.


Replies

Joel_Mckaylast Friday at 8:45 AM

> C++ is really good at optimizing for memory bandwidth

In general, most modern CPU thread-safe code is still a bodge in most languages. If folks are unfortunate enough to encounter inseparable overlapping state sub-problems, than there is no magic pixie dust to escape the computational cost. On average, attempting to parallelize this type of code can end up >30% slower on identical hardware, and a GPU memory copy exchange can make it even worse.

Sometimes even compared to a large multi-core CPU, a pinned-core higher clock-speed chip will win out for those types of problems.

Thus, the mystery why most people revert to batching k copies of single-core-bound non-parallel version of a program was it reduces latency, stalls, cache thrashing, i/o saturation, and interprocess communication costs.

Exchange costs only balloon higher across networks, as however fast the cluster partition claims to be... the physics is still going to impose space-time constraints, as modern data-centers will spend >15% of energy cost just moving stuff around networks for lower efficiency code.

I like languages like Julia, as it implicitly abstracts the broadcast operator to handle which areas may be cleanly unrolled. However, much like Erlang/Elixir the multi-host parallelization is not cleanly implemented... yet...

The core problem with HPC software, has always been academics are best modeled like hermit-crabs with facilities. Once a lucky individual inherits a nice new shell, the pincers come out to all smaller entities who may approach with competing interests.

Best of luck, =3

"Crabs Trade Shells in the Strangest Way | BBC Earth"

https://www.youtube.com/watch?v=f1dnocPQXDQ

bruce343434last Friday at 8:24 AM

What does it mean to be friendly to memory bandwidth, and why does C++ excel at it, over, say, Fortran or C or Rust?

show 3 replies
iamcreasylast Friday at 7:19 PM

Julia language is also used for HPC according to their webpage citing performance parity with C++. Would it be correct to infer Julia also provides the same level of memory bandwidth control?

show 1 reply
j4k0bfrlast Friday at 8:32 AM

I'm pretty interested in realtime computing and didn't realise C++ was considered bandwidth efficient! Coming from C, I find myself avoiding most 'new' C++ features because I can't easily figure out how they allocate without grabbing a memory profiler.

show 5 replies
convolvatronlast Friday at 4:30 PM

I worked in parallel computing in the late 80s and early 90s when parallel languages were really a thing. in HPC applications memory bandwidth is certainly a concern, although usually the global communications bandwidth (assuming they are different) is the roofline. by saying c++ you're implying that MPI is really sufficient, and its certainly possible to prop up parallel codes with MPI is really quite tiresome and hard to play with the really interesting problem which is the mapping of the domain state across the entire machine.

other hugely important problems that c++ doesn't address are latency hiding, which avoids stalling out your entire core waiting for distributed message, and a related solution which is interleave of computation and communication.

another related problem is that a lot of the very interesting hardware that might exist to do things like RDMA or in-network collective operations or even memory-controller based rich atomics, aren't part of the compiler's view and thus are usually library implementations or really hacky inlines.

is there a good turnkey parallel language? no. is there sufficient commonality in architecture or even a lot of investment in interesting ideas that were abandoned because of cost, no. but there remains a huge potential to exploit parallel hardware with implicit abstractions, and I think saying 'just use c++' is really missing almost all of the picture here.

addendum: even if you are working on a single-die multicore machine, if you don't account for locality, it doesn't matter how good your code generator is, you will saturate the memory network. so locality is an important and languages like Chapel are explicitly trying to provide useful abstractions for you to manage it.

suuuuuuuulast Friday at 10:37 AM

If you think C++ is the best here, then I don't think you've actually worked in this space nor appreciated the actual problems these languages try to solve. In particular because you can't program accelerators with C++.

Memory bandwidth is often the problem, yes. Language abstractions for performance aim to, e.g., automatically manage caches (that must be handled manually in performant GPU code, for instance) with optimized memory tiling and other strategies. Kernel fusion is another nontrivial example that improves effective bandwidth.

Adding on the diversity of hardware that one needs to target (both within and among vendors), i.e., portability not just of function but of performance, makes the need for better tooling abundantly obvious. C++ isn't even an entrant in this space.

show 2 replies