Efficient and Robust KPI Outlier Detection for Large-Scale Datacenters
Yongqian Sun,Daguo Cheng,Tiankai Yang,Yuhe Ji,Shenglin Zhang,Man Zhu,Xiao Xiong,Qiliang Fan,Minghan Liang,Dan Pei,Tianchi Ma,Yu Chen
DOI: https://doi.org/10.1109/TC.2023.3272288
IF: 3.183
2023-01-01
IEEE Transactions on Computers
Abstract:To ensure the performance of large-scale datacenters, operators need to monitor up to tens of millions of various-type KPIs, e.g., CPU utilization, memory utilization. For each KPI, it is crucial but challenging to detect outliers that deviate from its historical patterns or the patterns of other KPIs in the same period. In this work, we propose OutSpot, an unsupervised outlier detection framework that integrates hierarchical agglomerative clustering (HAC) with conditional variational autoencoder (CVAE), which significantly improves computational efficiency and comprehensively learns the above two patterns. Additionally, two simple yet effective techniques, soft threshold and median filter, are applied to precisely determine outlier KPIs. Using two real-world datasets collected from the datacenters owned by a top-tier global short video service provider and a top-tier domestic operator,respectively. It demonstrates that OutSpot achieves the best F1 score of 0.95 and 0.91, AUC of 0.99 and 0.99 on the two datasets, significantly outperforming seven baseline outlier detection methods.