Hierarchical Graph Neural Networks for Causal Discovery and Root Cause Localization

Dongjie Wang,Zhengzhang Chen,Jingchao Ni,Liang Tong,Zheng Wang,Yanjie Fu,Haifeng Chen
DOI: https://doi.org/10.48550/arXiv.2302.01987
2023-02-04
Abstract:In this paper, we propose REASON, a novel framework that enables the automatic discovery of both intra-level (i.e., within-network) and inter-level (i.e., across-network) causal relationships for root cause localization. REASON consists of Topological Causal Discovery and Individual Causal Discovery. The Topological Causal Discovery component aims to model the fault propagation in order to trace back to the root causes. To achieve this, we propose novel hierarchical graph neural networks to construct interdependent causal networks by modeling both intra-level and inter-level non-linear causal relations. Based on the learned interdependent causal networks, we then leverage random walks with restarts to model the network propagation of a system fault. The Individual Causal Discovery component focuses on capturing abrupt change patterns of a single system entity. This component examines the temporal patterns of each entity's metric data (i.e., time series), and estimates its likelihood of being a root cause based on the Extreme Value theory. Combining the topological and individual causal scores, the top K system entities are identified as root causes. Extensive experiments on three real-world datasets with case studies demonstrate the effectiveness and superiority of the proposed framework.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **Root Cause Localization in complex systems**. Specifically, existing methods mainly focus on constructing a single, isolated causal network, ignoring the interdependent structures of many complex systems in reality (i.e., multiple networks are interconnected through cross - network links). Therefore, in these interdependent networks, fault effects can propagate between different levels of different networks or system entities, resulting in sub - optimal root cause analysis results. ### Core problems of the paper 1. **Limitations of existing methods**: - Existing methods mainly focus on constructing a single effective isolated causal network, ignoring the complexity and interdependence in real - world systems. - The interdependent relationships between multiple networks are ignored, leading to inaccurate root cause localization of faults. 2. **Actual requirements**: - Faults in complex systems (such as microservice systems, industrial control systems, etc.) will affect user experience and cause economic losses. Therefore, efficient and accurate root cause analysis is required to quickly restore services and reduce losses. ### Proposed solutions To solve the above problems, the paper proposes a new framework named **REASON** for automatically discovering causal relationships within and across networks and accurately locating the root cause of faults. The REASON framework includes two main components: 1. **Topological Causal Discovery (TCD)**: - It aims to model the fault propagation path to trace back to the root cause. - It uses Hierarchical Graph Neural Networks to construct interdependent causal networks and capture non - linear causal relationships within and across networks. - It utilizes the Random Walk with Restarts model for network fault propagation. 2. **Individual Causal Discovery (ICD)**: - It focuses on capturing the mutation patterns of individual system entities. - It analyzes the time - series data of each entity and estimates the probability of it being the root cause based on Extreme Value Theory. ### Integration and output Finally, REASON combines the topological causal score and the individual causal score and selects the top \( K \) system entities with the highest scores as the root causes. ### Summary The goal of the paper is to accurately identify the root cause of system faults by learning the causal relationships of multi - level interconnected systems, thereby improving the stability and robustness of complex systems.