Root Cause Analysis In Microservice Using Neural Granger Causal Discovery

Cheng-Ming Lin,Ching Chang,Wei-Yao Wang,Kuang-Da Wang,Wen-Chih Peng
2024-02-02
Abstract:In recent years, microservices have gained widespread adoption in IT operations due to their scalability, maintenance, and flexibility. However, it becomes challenging for site reliability engineers (SREs) to pinpoint the root cause due to the complex relationships in microservices when facing system malfunctions. Previous research employed structured learning methods (e.g., PC-algorithm) to establish causal relationships and derive root causes from causal graphs. Nevertheless, they ignored the temporal order of time series data and failed to leverage the rich information inherent in the temporal relationships. For instance, in cases where there is a sudden spike in CPU utilization, it can lead to an increase in latency for other microservices. However, in this scenario, the anomaly in CPU utilization occurs before the latency increase, rather than simultaneously. As a result, the PC-algorithm fails to capture such characteristics. To address these challenges, we propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning. RUN enhances the backbone encoder by integrating contextual information from time series, and leverages a time series forecasting model to conduct neural Granger causal discovery. In addition, RUN incorporates Pagerank with a personalization vector to efficiently recommend the top-k root causes. Extensive experiments conducted on the synthetic and real-world microservice-based datasets demonstrate that RUN noticeably outperforms the state-of-the-art root cause analysis methods. Moreover, we provide an analysis scenario for the sock-shop case to showcase the practicality and efficacy of RUN in microservice-based applications. Our code is publicly available at
Machine Learning,Artificial Intelligence,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
This paper attempts to solve the problem of how to accurately locate the root cause in case of system failures in the microservice architecture. Specifically: 1. **Background problems**: - With the wide application of the microservice architecture, its complex relationships make it difficult for site reliability engineers (SREs) to accurately locate the root cause when facing system failures. - Although traditional structured learning methods (such as the PC algorithm) can establish causal relationships and deduce the root cause, they ignore the time - order information in time - series data and cannot fully utilize the rich information in time relationships. 2. **Specific challenges**: - When a service becomes abnormal, due to the dependencies between microservices, a chain reaction may be triggered, causing problems in other services and ultimately leading to system failures. - Existing methods (such as the PC algorithm) fail to capture the time - dependence in time - series data. For example, a sudden increase in CPU utilization will lead to an increase in the latency of other microservices, but this abnormality occurs at different time points, and existing methods cannot effectively handle this situation. 3. **Solutions**: - The paper proposes a new method - RUN (Root Cause Analysis Using Neural Granger Causal Discovery), which uses neural Granger causal discovery and contrastive learning to solve these problems. - RUN enhances the backbone encoder to integrate the context information in the time - series and uses the time - series prediction model for neural Granger causal discovery. - In addition, RUN combines the PageRank algorithm with personalized vectors to efficiently recommend the top k root causes. 4. **Innovations**: - Proposed the self - supervised neural Granger causal discovery framework RUN, which can capture the context information in the time - series and use the time - series prediction model to construct a causal graph between multivariate time - series. - Introduced an innovative time - series contrastive learning method, which only regards different context instances at the same time - stamp as positive sample pairs, preventing the introduction of wrong negative sample pairs. - Experiments on synthetic datasets and real - world microservice datasets show that RUN significantly outperforms existing root - cause analysis methods. In summary, this paper aims to develop a more accurate and effective microservice root - cause analysis method. By combining neural Granger causal discovery and contrastive learning, it makes full use of the time - dependence in time - series data, so as to better locate the root cause of system failures.