logoalt Hacker News

pphyschlast Friday at 3:22 PM1 replyview on HN

I promise you that the main reason HPC is behind on virtualization is not because of the little bit of overhead. There are a dozen other inefficiencies in the average HPC workload that are more significant.

Most centers don't even have good real-time observability systems to diagnose systemic inefficiencies, leaving application/workload profiling purely up to user-space.

The HP in HPC has really been watered down over the last couple decades, and "IT for computational research" would be a more accurate name. You can do genuinely high-performance computing there, but you'll be an outlier.


Replies

saltcuredlast Friday at 10:41 PM

It's a mixture of legacy and reality.

For one, the assumption has been that you had dedicated use of all the nodes and communication network. It would kill your performance if your local node CPU scheduler was interfering with having your actual HPC program active when the messages were coming in from its peer tasks on the other nodes, since parallel jobs are limited in the end by the critical path latency of the cross-node communications.

It's only on the most "embarrassingly parallel" end of the spectrum where you can tolerate a bunch of virtualization and non-determinism, because the tasks communicate so infrequently or via such asynchronous mechanisms that they don't really impact the throughput of the whole job if they are asleep at random times.

But HPC systems also were very "unique". It wasn't just all Linux but a dozen different vendors' Unix variants with very different personalities. And for the bleeding-edge systems, each deployment was practically its own dialect of that vendor OS. Running a job was like cross-compiling to a one of a kind target. There was no generic platform where you could expect to build an app once and ship it around to whichever supercomputer was available.

show 1 reply