logoalt Hacker News

washedDeveloperlast Tuesday at 7:23 PM1 replyview on HN

The org I work on develops HTCondor. We have a lot of scientists that end up running alphafold and other bio related models on our pool of GPUs and CPUs. I am curious to know how and why your team implemented yet another job scheduler. HTCondor is agnostic to the software being ran, so maybe there is more clever scheduling you can come up with. That being said, HTCondor also has pretty high flexibility with regards to policy.


Replies

denizkavilast Tuesday at 9:38 PM

That’s interesting. We’ve developed a kubernetes-based scheduler that we’ve found better takes into account our custom job priority needs, allows for more strict data isolation between tenants, and a production grade control plane, though the core scheduling could certainly be implemented in something like HTCondor.

Originally, my first instinct was to use Slurm or AWS batch, but started having problems once we tried to multi cloud. We're also optimizing for being able to onboard an arbitrary codebase as fast as possible, so building a custom structure natively compatible with our containers (which are now automatically made from linux machines with the relevant models deployed) has been helpful.