Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems

Canhua Wu,Nengwen Zhao,Lixin Wang,Xiaoqin Yang,Shining Li,Ming Zhang,Xing Jin,Xidao Wen,Xiaohui Nie,Wenchi Zhang,Kaixin Sui,Dan Pei
DOI: https://doi.org/10.1109/issre52982.2021.00022
2021-01-01
Abstract:Incidents in online service systems could incur poor user experience and tremendous economic loss. To reduce the influence of incidents and guarantee service reliability, it is critical to identify root-cause metrics for engineers with clues to assist incident diagnosis. However, it is a challenging task due to the complicated dependencies and huge volume of various metrics in large-scale systems. Existing approaches are based on either anomaly detection or correlation analysis, performing not well in terms of accuracy or efficiency. To better understand the problem of root-cause metric identification, we conduct a preliminary study based on real-world data analysis and interactions with engineers. The key observation is that root-cause metrics should satisfy two requirements. One is that the metric is expected to behave abnormally during the incident; the other is that the anomaly pattern should meet physical meaning and engineers' demand. Motivated by the findings obtained from the study, we propose an effective approach named PatternMatcher to identifying root-cause metrics accurately. Specifically, PatternMatcher contains three steps, where coarse-grained anomaly detection aiming to filter out normal metrics, anomaly pattern classification aiming to filter out unimportant anomaly patterns, and root-cause metric ranking. An extensive study on four real-world datasets including 113 incident cases from a large commercial bank demonstrates that PatternMatcher outperforms all baseline approaches, achieving top-3 average accuracy of 0.91. Moreover, we have deployed PatternMatcher in practice and shared some successful cases from real deployment.
What problem does this paper attempt to address?