Retrospecting Available CPU Resources: SMT-Aware Scheduling to Prevent SLA Violations in Data Centers

Haoyu Liao,Tong-yu Liu,Jianmei Guo,Bo Huang,Dingyu Yang,Jonathan Ding
DOI: https://doi.org/10.1109/tpds.2024.3494879
IF: 5.3
2024-11-29
IEEE Transactions on Parallel and Distributed Systems
Abstract:The article focuses on an understudied yet fundamental problem: existing methods typically average the utilization of multiple hardware threads to evaluate the available CPU resources. However, the approach could underestimate the actual usage of the underlying physical core for Simultaneous Multi-Threading (SMT) processors, leading to an overestimation of remaining resources. The overestimation propagates from microarchitecture to operating systems and cloud schedulers, which may misguide scheduling decisions, exacerbate CPU overcommitment, and increase Service Level Agreement (SLA) violations. To address the potential overestimation problem, we propose an SMT-aware and purely data-driven approach named Remaining CPU (RCPU) that reserves more CPU resources to restrict CPU overcommitment and prevent SLA violations. RCPU requires only a few modifications to the existing cloud infrastructures and can be scaled up to large data centers. Extensive evaluations in the data center proved that RCPU contributes to a reduction of SLA violations by 18% on average for 98% of all latency-sensitive applications. Under a benchmarking experiment, we prove that RCPU increases the accuracy by 69% in terms of Mean Absolute Error (MAE) compared to the state-of-the-art.
computer science, theory & methods,engineering, electrical & electronic
What problem does this paper attempt to address?