Understanding and Handling Alert Storm for Online Service Systems
Nengwen Zhao,Junjie Chen,Xiao Peng,Honglin Wang,Xinya Wu,Yuanzong Zhang,Zikai Chen,Xiangzhong Zheng,Xiaohui Nie,Gang Wang,Yong Wu,Fang Zhou,Wenchi Zhang,Kaixin Sui,Dan Pei
DOI: https://doi.org/10.1145/3377812.3390809
2020-01-01
Abstract:Alert is a kind of key data source in monitoring system for online service systems, which is used to record the anomalies in service components and report to engineers. In general, the occurrence of a service failure tends to be along with a large number of alerts, which is called alert storm. However, alert storm brings great challenges to diagnose the failure, since it is time-consuming and tedious for engineers to investigate such an overwhelming number of alerts manually. To help understand alert storm, we conduct the first empirical study of alert storm based on large-scale real-world alert data and gain some valuable insights. Based on the findings, we propose a novel approach to handling alert storm. Specifically, this approach includes alert storm detection which aims to identify alert storm accurately, and alert storm summary which aims to recommend a small set of representative alerts to engineers for failure diagnosis. Our experimental study on real-world dataset demonstrates that our alert storm detection can achieve high F1-score (larger than 0.9). Besides, our alert storm summary can reduce the number of alerts that need to be examined by more than 98% and discover useful alerts accurately.