Zeus: Improving Resource Efficiency Via Workload Colocation for Massive Kubernetes Clusters

Xiaolong Zhang,Lanqing Li,Yuan Wang,E. Chen,Lidan Shou
DOI: https://doi.org/10.1109/access.2021.3100082
IF: 3.9
2021-01-01
IEEE Access
Abstract:With the popularity of container-based microservices and cloud-native architectures, Kubernetes has established itself as the de facto standard for container orchestration. Kubernetes is known for its advantage in easy deployment and operations for applications; however, it suffers from low resource utilization, incurring high server provisioning and operational costs, due to following reasons. First, it is common practice for latency-sensitive services to be over-provisioned for the peak load: Kubernetes might consider such peak-load provisioning as constant, even though the actual resource utilization is low. Second, the isolation between different containers is poor and cannot prevent performance degradation when best-effort jobs and latency-sensitive services run together. Users may have to run these two classes of workloads on separate machines to avoid interference, resulting in higher provisioning costs. This paper presents a highly scalable cluster scheduling system named Zeus, which is designed based on Kubernetes extension mechanisms. Zeus achieves safe colocation of best-effort jobs and latency-sensitive services. Furthermore, Zeus schedules best-effort jobs based on the real server utilization and can adaptively allocate resources between the two classes of workloads. In addition, Zeus enhances container isolation by coordinating software and hardware isolation features. As a result, Zeus can effectively improve the resource utilization of Kubernetes clusters. We discuss the design and implementation of Zeus and evaluate its effectiveness using latency-sensitive services and best-effort jobs in a massive production environment. The results show that by colocating latency-sensitive services with best-effort jobs, Zeus can increase the average CPU utilization from 15% to 60% without SLO violations.
What problem does this paper attempt to address?