logoalt Hacker News

slikenlast Friday at 4:21 PM0 repliesview on HN

As you dig deeper I think you'll find a method behind the madness.

Sure modules just play with env variables. But it's easy to inspect (module show), easy to document "use modules load ...", allows admins to change the default when things improve/bug fixed, but also allows users to pin the version. It's very transparent, very discover-able, and very "stale". Research needs dictate that you can reproduce research from years past. It's much easier to look at your output file and see the exact version of compiler, MPI stack, libraries, and application than trying to dig into a container build file or similar. Not to mention it's crazy more efficient to look at a few lines of output than to keep the container around.

As for slurm, I find it quite useful. Your main complaint is no default systemd service files? Not like it's hard to setup systemd and dependencies. Slurms job is scheduling, which involves matching job requests for resources, deciding who to run, and where to run it. It does that well and runs jobs efficiently. Cgroup v2, pinning tasks to the CPU it needs, placing jobs on CPU closest to the GPU it's using, etc. When combined with PMIX2 it allows impressive launch speeds across large clusters. I guess if your biggest complaint is the systemd service files that's actually high praise. You did mention logging, I find it pretty good, you can increase the verbosity and focus on server (slurmctld) or client side (slurmd) and enable turning on just what you are interested, like say +backfill. I've gotten pretty deep into the weeds and basically everything slurm does can be logged, if you ask for it.

Sounds like you've used some poorly run clusters, I don't doubt it, but I wouldn't assume that's HPC in general. I've built HPC clusters and did not use the university's AD, specifically because it wasn't reliable enough. IMO a cluster should continue to schedule and run jobs, even if the uplink is down. Running a past EoL OS on an HPC cluster is definitely a sign that it's not run well and seems common when a heroic student ends up managing a cluster and then graduates leaving the cluster unmanaged. Sadly it's pretty common for IT to run a HPC cluster poorly, it's really a different set of contraints, thus the need for a HPC group.

Plenty of HPC clusters out there a happy to support the tools that helps their users get the most research done.