Causality Enhanced Graph Representation Learning for Alert-Based Root Cause Analysis

Zhaoyang Yu,Qianyu Ouyang,Changhua Pei,Xin Wang,Wenxiao Chen,Liangfei Su,Huai Jiang,Xuanrun Wang,Jianhui Li,Dan Pei
DOI: https://doi.org/10.1109/ccgrid59990.2024.00018
2024-01-01
Abstract:Accurate and efficient root cause identification in online service systems is critical for service stability and user experience. When a system failure occurs, numerous alerts are generated, but existing methods fail to effectively integrate all these multi-modal data to pinpoint the root causes. Moreover, most existing approaches are inefficient for large-scale online services due to their high reliance on handcrafted rules and domain expertise. This paper introduces AlertRCA, an algorithm for Root Cause Analysis (RCA) based on Alert events. It utilizes a pre-trained Alert2Vec module to encode multi-modal alert information into vectors, and implements an RCA-oriented causality prediction graph attention network (CPGAT) to automatically gauge causal relationships between alerts. Further, we devise a novel dispersing and aggregating graph neural network (DAGNN) to identify root causes. Experiments on a real-world dataset collected from a top-tier e-commerce company reveal AlertRCA’s superior performance, achieving 83.9% top-1 and 96.8% top-3 accuracy on average. Our codes are available at https://github.com/NetManAIOps/AlertRCA.
What problem does this paper attempt to address?