CMDiagnostor: an Ambiguity-Aware Root Cause Localization Approach Based on Call Metric Data

Qingyang Yu,Changhua Pei,Bowen Hao,Mingjie Li,Zeyan Li,Shenglin Zhang,Xianglin Lu,Rui Wang,Jiaqi Li,Zhenyu Wu,Dan Pei
DOI: https://doi.org/10.1145/3543507.3583302
2023-01-01
Abstract:The availability of online services is vital as its strong relevance to revenue and user experience. To ensure online services’ availability, quickly localizing the root causes of system failures is crucial. Given the high resource consumption of traces, call metric data are widely used by existing approaches to construct call graphs in practice. However, ambiguous correspondences between upstream and downstream calls may exist and result in exploring unexpected edges in the constructed call graph. Conducting root cause localization on this graph may lead to misjudgments of real root causes. To the best of our knowledge, we are the first to investigate such ambiguity, which is overlooked in the existing literature. Inspired by the law of large numbers and the Markov properties of network traffic, we propose a regression-based method (named AmSitor) to address this problem effectively. Based on AmSitor, we propose an ambiguity-aware root cause localization approach based on Call Metric Data named CMDiagnostor, containing metric anomaly detection, ambiguity-free call graph construction, root cause exploration, and candidate root cause ranking modules. The comprehensive experimental evaluations conducted on real-world datasets show that our CMDiagnostor can outperform the state-of-the-art approaches by 14% on the top-5 hit rate. Moreover, AmSitor can also be applied to existing baseline approaches separately to improve their performances one step further. The source code is released at https://github.com/NetManAIOps/CMDiagnostor.
What problem does this paper attempt to address?