Tetris: Proactive Container Scheduling for Long-Term Load Balancing in Shared Clusters

Fei Xu,Xiyue Shen,Shuohao Lin,Li Chen,Zhi Zhou,Fen Xiao,Fangming Liu
DOI: https://doi.org/10.1109/tsc.2024.3442544
IF: 11.019
2024-10-11
IEEE Transactions on Services Computing
Abstract:Long-running containerized workloads (e.g., machine learning), which typically show time-varying patterns, are increasingly prevailing in shared production clusters. To improve workload performance, current schedulers mainly focus on optimizing short-term benefits of cluster load balancing or initial container placement on servers. However, this would inevitably bring many invalid migrations (i.e., containers are migrated back and forth among servers over a short time window), leading to significant service level objective (SLO) violations. This paper introduces Tetris, a model predictive control (MPC)-based container scheduling strategy to proactively migrate long-running workloads for cluster load balancing. Specifically, we first build a discrete-time dynamic model for long-term optimization of container scheduling. To solve such an optimization problem, Tetris then employs two main components: (1) a container resource predictor, which leverages time-series analysis approaches to accurately predict the container resource consumption; (2) an MPC-based container scheduler that jointly optimizes the cluster load balancing and container migration cost over a certain sliding time window. We implement and open source a prototype of Tetris based on K8s. Extensive prototype experiments and trace-driven simulations demonstrate that Tetris can improve the cluster load balancing degree by up to 77.8% without incurring any SLO violations, compared to the state-of-the-art container scheduling strategies.
computer science, information systems, software engineering
What problem does this paper attempt to address?