HPC systems often still use batch scheduling systems where (even for a fast job) you may very well g...

musicale • today at 12:36 AM • 0 replies • view on HN

HPC systems often still use batch scheduling systems where (even for a fast job) you may very well get your results the next day (or whenever your job actually runs and completes.)

It is annoying to find out that your job failed to run or exited immediately due to a typo or other minor mistake.

Of course ML training (and scientific computing) jobs can take weeks or months to complete. Checkpoint and restart features are important because node or other failures are almost inevitable.

alt Hacker News