Towards Standard Kubernetes Scheduling Interfaces for Converged Computing

Claudia Misale,Daniel J. Milroy,Carlos Eduardo Arango Gutierrez,Maurizio Drocco,Stephen Herbein,Dong H. Ahn,Zvonko Kaiser,Yoonho Park
DOI: https://doi.org/10.1007/978-3-030-96498-6_18
2022-01-01
Abstract:High performance computing (HPC) and cloud technologies are increasingly coupled to accelerate the convergence of traditional HPC with new simulation, data analysis, machine-learning, and artificial intelligence approaches. While the HPC+cloud paradigm, or converged computing, is ushering in new scientific discoveries with unprecedented levels of workflow automation, several key mismatches between HPC and cloud technologies still preclude this paradigm from realizing its full potential. In this paper, we present a joint effort between IBM Research, Lawrence Livermore National Laboratory (LLNL), and Red Hat to address the mismatches and to bring full HPC scheduling awareness into Kubernetes, the de facto container orchestrator for cloud-native applications, which is being increasingly adopted as a key converged-computing enabler. We found Kubernetes lacking of interfaces to enable the full spectrum of converged-computing use cases in the following three areas: (A) an interface to enable HPC batch-job scheduling (e.g., locality-aware node selection), (B) an interface to enable HPC workloads or task-level scheduling, and (C) a resource co-management interface to allow HPC resource managers and Kubernetes to co-manage a resource set. We detail our methodology and present our results, whereby the advanced graph-based scheduler Fluxion – part of the open-source Flux scheduling framework – is integrated as a Kubernetes scheduler plug-in, KubeFlux. Our initial performance study shows that KubeFlux exhibits similar performance (up to measurement precision) to the default scheduler, despite KubeFlux’s considerably more sophisticated scheduling capabilities.
What problem does this paper attempt to address?