Hotspot-Aware Scheduling of Virtual Machines with Overcommitment for Ultimate Utilization in Cloud Datacenters
Jiaxi Wu,Pavel Popov,Wenquan Yang,Andrei Gudkov,Elizaveta Ponomareva,Xinming Han,Yunzhe Qiu,Jie Song,Stepan Romanov
DOI: https://doi.org/10.1109/tase.2024.3454821
IF: 6.636
2024-01-01
IEEE Transactions on Automation Science and Engineering
Abstract:We address the problem of under-utilization of resources in datacenters during cloud operations, specifically focusing on the challenge of online virtual machine (VM) scheduling. Rather than following the traditional approach of scheduling VMs based solely on their static flavors, we take into account their dynamic CPU utilization. We employ Gamma -robustness theory to manage the dynamic nature and introduce a novel variant of bin packing -(), which theoretically protects the Physical Machines (PMs) from hotspots formation within a specified probability alpha . We develop a scheduling algroithm named CloseRadiusFit and cold-start AI-based prediction algorithms for the online version of . To verify the quality of our approach towards the optimal solutions, we solve the Offline problem by designing a novel Mixed Integer Linear Programming (MILP) model and a combination of numerical upper and lower bounds. Our experimental results demonstrate that CloseRadiusFit achieves narrow gaps of 1.6% and 3.1% when compared to the lower and upper bounds, respectively. Note to Practitioners -A growing trend in the cloud industry involves overcommitting VMs on PMs. While this approach can ease the problem of low utilization of resources in datacenters, it also introduces a higher risk of hotspots due to resource contention and competition among VMs. In this work, we propose a novel method that leverages Gamma -robustness theory and introduce effective heuristics to achieve ultimate utilization of datacenter resources while ensuring desirable service quality. We validate our approach using real-world production data from Huawei Cloud, improving resource utilization by 125% over traditional flavor-based allocation methods, while maintaining the occurrence of hotspots below 5% ( alpha=0.05 ). Our solution only requires VMs' real utilization data that is typically already collected in cloud providers' production environments. Therefore, with minimal modifications to the existing scheduling system, cloud providers can easily implement our solution and reap its benefits. Moreover, in cases of the absence of historical utilization data for VMs (cold-start), we use machine learning to predict VM utilization statistics for our approach.