Sure, the conventional way of doing things is OpenMP on a node and MPI across nodes, but
* It just seems like a lot of threads to wrangle without some hierarchy. Nested OpenMP is also possible…
* I’m wondering if explicit communication is better from one die to another in this sort of system.
With 2 IO dies aren't there effectively 2 meta NUMA nodes with 4 leaf nodes each? Or am I off base there?
The above doesn't even consider the possibility of multi-CPU systems. I suspect the existing programming models are quickly going to become insufficient for modeling these systems.
I also find myself wondering how atomic instruction performance will fare on these. GPU ISA and memory model on CPU when?