Robust System Instance Clustering for Large-Scale Web Services

Shenglin Zhang,Dongwen Li,Zhenyu Zhong,Jun Zhu,Minghan Liang,Jiexi Luo,Yongqian Sun,Ya Su,Sibo Xia,Zhongyou Hu,Yuzhi Zhang,Dan Pei,Jiyan Sun,Yinlong Liu
DOI: https://doi.org/10.1145/3485447.3511983
2022-01-01
Abstract:System instance clustering is crucial for large-scale Web services because it can significantly reduce the training overhead of anomaly detection methods. However, the vast number of system instances with massive time points, redundant metrics, and noise bring significant challenges. We propose OmniCluster to accurately and efficiently cluster system instances for large-scale Web services. It combines a one-dimensional convolutional autoencoder (1D-CAE), which extracts the main features of system instances, with a simple, novel, yet effective three-step feature selection strategy. We evaluated OmniCluster using real-world data collected from a top-tier content service provider providing services for one billion+ monthly active users (MAU), proving that OmniCluster achieves high accuracy (NMI=0.9160) and reduces the training overhead of five anomaly detection models by 95.01% on average.
What problem does this paper attempt to address?