Detecting Data Center Cooling Problems Using a Data-driven Approach

Charley Chen,Guosai Wang,Jiao Sun,Wei Xu
DOI: https://doi.org/10.1145/3265723.3265730
2018-01-01
Abstract:Cooling problems are common in data centers and many of them are hard to detect especially the hidden. These problems affect overall system dependability, performance and power efficiency. We propose a novel method to detect the cooling problems. Using common monitoring data available in most data centers, such as environmental temperature and hardware status, we build a workload-independent cooling profile for each server. With the cooling profiles, we are able to detect two types of both transient and lasting cooling failures. We detect transient failures by comparing the observed temperature with the model prediction, while we detect lasting failures by comparing the cooling profiles among different servers. We demonstrate the general applicability of our detection methods in three production data centers with vastly different scale, server types and workload, and detect several real cooling problems that have been hidden for months.
What problem does this paper attempt to address?