Data Center Temperature Prediction and Management Based on a Two-stage Self-Healing Model

Wang Simin,Kang Yifei,Xu Yixuan,Ma Chunmiao,Wang Haitao,Wu Weiguo
DOI: https://doi.org/10.1016/j.simpat.2023.102883
IF: 4.199
2024-01-01
Simulation Modelling Practice and Theory
Abstract:While providing efficient and convenient cloud services, data center also brings great pressure to energy consumption and environment. The rise of server temperature not only increases the refrigeration cost, but also seriously affects the operation safety of the data center. Effective analysis and prediction of data center temperature is not only conducive to preventing server overheating and shutdown, but also crucial to data center task scheduling, resource allocation optimization and energy efficiency improvement of data center. Therefore, this article proposes a Two-stage Gated Recurrent Unit (GRU) temperature prediction algorithm with self-healing mechanism. The algorithm establishes a prediction model for the important parameters affecting temperature prediction - CPU utilization, and takes the output of the model as the input parameter of the server temperature prediction model, which fits the changes of each parameter more accurate. To avoid the decrease in prediction accuracy caused by new operating conditions that have not been learned before and changes in physical environmental factors during the operation of the model, a self-healing mechanism is proposed to ensure the prediction accuracy of the model. Experiments show that our prediction model can accurately predict the inlet temperature evolution of the server with dynamic workload. It reduces the prediction error (RSME) to 0.280, and the average prediction temperature difference is only 0.675, which is 10 % higher than the single stage prediction accuracy. The use of Two-stage prediction methods in other machine learning methods can also improve prediction accuracy. Based on the prediction model, this paper proposes a task scheduling algorithm that minimizes temperature difference. The algorithm can make the temperature between servers more balanced after task allocation, effectively reducing the number of servers running at high and low temperatures in the data center, avoiding refrigeration waste, and achieving energy conservation in the data center.
What problem does this paper attempt to address?