Auto-scaling HTCondor pools using Kubernetes compute resources

Igor Sfiligoi,Thomas DeFanti,Frank Würthwein
DOI: https://doi.org/10.48550/arXiv.2205.01004
2022-05-02
Distributed, Parallel, and Cluster Computing
Abstract:HTCondor has been very successful in managing globally distributed, pleasantly parallel scientific workloads, especially as part of the Open Science Grid. HTCondor system design makes it ideal for integrating compute resources provisioned from anywhere, but it has very limited native support for autonomously provisioning resources managed by other solutions. This work presents a solution that allows for autonomous, demand-driven provisioning of Kubernetes-managed resources. A high-level overview of the employed architectures is presented, paired with the description of the setups used in both on-prem and Cloud deployments in support of several Open Science Grid communities. The experience suggests that the described solution should be generally suitable for contributing Kubernetes-based resources to existing HTCondor pools.
What problem does this paper attempt to address?