Disentangled Causal Graph Learning for Online Unsupervised Root Cause Analysis

Dongjie Wang,Zhengzhang Chen,Yanjie Fu,Yanchi Liu,Haifeng Chen
2023-06-03
Abstract:The task of root cause analysis (RCA) is to identify the root causes of system faults/failures by analyzing system monitoring data. Efficient RCA can greatly accelerate system failure recovery and mitigate system damages or financial losses. However, previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process, a significant amount of time and data to train a robust model, and then being retrained from scratch for a new system fault. In this paper, we propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model. CORAL consists of Trigger Point Detection, Incremental Disentangled Causal Graph Learning, and Network Propagation-based Root Cause Localization. The Trigger Point Detection component aims to detect system state transitions automatically and in near-real-time. To achieve this, we develop an online trigger point detection approach based on multivariate singular spectrum analysis and cumulative sum statistics. To efficiently update the RCA model, we propose an incremental disentangled causal graph learning approach to decouple the state-invariant and state-dependent information. After that, CORAL applies a random walk with restarts to the updated causal graph to accurately identify root causes. The online RCA process terminates when the causal graph and the generated root cause list converge. Extensive experiments on three real-world datasets with case studies demonstrate the effectiveness and superiority of the proposed framework.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the problem of Root Cause Analysis (RCA) for system failures. Specifically, most existing methods rely on offline RCA algorithms, which require manual triggering of the RCA process and a large amount of historical data to train a robust model. Once a new failure occurs in the system, the model needs to be retrained. This is not only time-consuming but also unable to mitigate the damage or loss caused by the failure in a timely manner. To solve these issues, the paper proposes a new framework called CORAL. This framework can automatically trigger the RCA process in the system monitoring data stream and incrementally update the RCA model. CORAL mainly consists of the following three parts: 1. **Trigger Point Detection**: Automatically detect trigger points for system state changes. 2. **Incremental Decoupling Causal Graph Learning**: Learn causal graphs incrementally, decoupling state-invariant information from state-dependent information. 3. **Network Propagation-based Root Cause Localization**: Use the updated causal graph to locate the root cause. The goal of CORAL is to automatically initiate the RCA process when a system failure occurs and update the initial causal graph by processing each batch of data incrementally, thereby efficiently identifying the nodes most related to the system's Key Performance Indicators (KPIs). When the learned causal graph and the generated root cause list converge, the system operator will receive the final root cause for system recovery.