DeepScaling
Ziliang Wang,Shiyi Zhu,Jianguo Li,Wei Jiang,K. K. Ramakrishnan,Yangfei Zheng,Meng Yan,Xiaohong Zhang,Alex X. Liu
DOI: https://doi.org/10.1145/3542929.3563469
2022-01-01
Abstract:Cloud service providers conservatively provision excessive resources to ensure service level objectives (SLOs) are met. They often set lower CPU utilization targets to ensure service quality is not degraded, even when the workload varies significantly. Not only does this potentially waste resources, but it can also consume excessive power in large-scale cloud deployments. This paper aims to minimize resource costs while ensuring SLO requirements are met in a dynamically varying, large-scale production microservice environment. We propose DeepScaling, which introduces three innovative components to adaptively refine the target CPU utilization to a level that is maintained at a stable value to meet SLO constraints while using minimum resources. First, DeepScaling forecasts the workload for each service using a Spatio-temporal Graph Neural Network. Second, DeepScaling estimates the CPU utilization by mapping the workload intensity to an estimated CPU utilization with a Deep Neural Network, while taking into account multiple factors in the cloud environment (e.g., periodic tasks and traffic). Third, DeepScaling generates an autoscaling policy for each service based on an improved Deep Q Network (DQN). The adaptive autoscaling policy updates the target CPU utilization to be a maximum, stable value, while ensuring SLOs is not violated. We compare DeepScaling with state-of-the-art autoscaling approaches in the large-scale production cloud environment of the Ant Group. It shows that DeepScaling outperforms other approaches both in terms of maintaining stable service performance, and saving resources, by a significant margin. The deployment of DeepScaling in Ant Group's real production environment with 135 microservices saves the provisioning of over 30,000 CPU cores per day, on average.