Abstract:In recent years, microservices have gained widespread adoption in IT operations due to their scalability, maintenance, and flexibility. However, it becomes challenging for site reliability engineers (SREs) to pinpoint the root cause due to the complex relationships in microservices when facing system malfunctions. Previous research employed structured learning methods (e.g., PC-algorithm) to establish causal relationships and derive root causes from causal graphs. Nevertheless, they ignored the temporal order of time series data and failed to leverage the rich information inherent in the temporal relationships. For instance, in cases where there is a sudden spike in CPU utilization, it can lead to an increase in latency for other microservices. However, in this scenario, the anomaly in CPU utilization occurs before the latency increase, rather than simultaneously. As a result, the PC-algorithm fails to capture such characteristics. To address these challenges, we propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning. RUN enhances the backbone encoder by integrating contextual information from time series, and leverages a time series forecasting model to conduct neural Granger causal discovery. In addition, RUN incorporates Pagerank with a personalization vector to efficiently recommend the top-k root causes. Extensive experiments conducted on the synthetic and real-world microservice-based datasets demonstrate that RUN noticeably outperforms the state-of-the-art root cause analysis methods. Moreover, we provide an analysis scenario for the sock-shop case to showcase the practicality and efficacy of RUN in microservice-based applications. Our code is publicly available at

What problem does this paper attempt to address?

This paper attempts to solve the problem of how to accurately locate the root cause in case of system failures in the microservice architecture. Specifically: 1. **Background problems**: - With the wide application of the microservice architecture, its complex relationships make it difficult for site reliability engineers (SREs) to accurately locate the root cause when facing system failures. - Although traditional structured learning methods (such as the PC algorithm) can establish causal relationships and deduce the root cause, they ignore the time - order information in time - series data and cannot fully utilize the rich information in time relationships. 2. **Specific challenges**: - When a service becomes abnormal, due to the dependencies between microservices, a chain reaction may be triggered, causing problems in other services and ultimately leading to system failures. - Existing methods (such as the PC algorithm) fail to capture the time - dependence in time - series data. For example, a sudden increase in CPU utilization will lead to an increase in the latency of other microservices, but this abnormality occurs at different time points, and existing methods cannot effectively handle this situation. 3. **Solutions**: - The paper proposes a new method - RUN (Root Cause Analysis Using Neural Granger Causal Discovery), which uses neural Granger causal discovery and contrastive learning to solve these problems. - RUN enhances the backbone encoder to integrate the context information in the time - series and uses the time - series prediction model for neural Granger causal discovery. - In addition, RUN combines the PageRank algorithm with personalized vectors to efficiently recommend the top k root causes. 4. **Innovations**: - Proposed the self - supervised neural Granger causal discovery framework RUN, which can capture the context information in the time - series and use the time - series prediction model to construct a causal graph between multivariate time - series. - Introduced an innovative time - series contrastive learning method, which only regards different context instances at the same time - stamp as positive sample pairs, preventing the introduction of wrong negative sample pairs. - Experiments on synthetic datasets and real - world microservice datasets show that RUN significantly outperforms existing root - cause analysis methods. In summary, this paper aims to develop a more accurate and effective microservice root - cause analysis method. By combining neural Granger causal discovery and contrastive learning, it makes full use of the time - dependence in time - series data, so as to better locate the root cause of system failures.

Root Cause Analysis In Microservice Using Neural Granger Causal Discovery

Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?

Automated Root Cause Analysis with Observability Data - A Comprehensive Review

Localizing Failure Root Causes in a Microservice through Causality Inference

Root-cause analysis for time-series anomalies via spatiotemporal causal graphical modeling.

MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

Chain-of-Event: Interpretable Root Cause Analysis for Microservices Through Automatically Learning Weighted Event Causal Graph

Root-Cause Metric Location for Microservice Systems Via Log Anomaly Detection

CausalRCA: Causal Inference based Precise Fine-grained Root Cause Localization for Microservice Applications

CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems

Look Deep into the Microservice System Anomaly Through Very Sparse Logs

KGroot: Enhancing Root Cause Analysis through Knowledge Graphs and Graph Convolutional Neural Networks

Practical Root Cause Localization for Microservice Systems Via Trace Analysis

CUTS: Neural Causal Discovery from Unstructured Time-Series Data

Root-cause analysis for time-series anomalies via spatiotemporal graphical modeling in distributed complex systems

CUTS: Neural Causal Discovery from Irregular Time-Series Data

Unsupervised Detection of Microservice Trace Anomalies Through Service-Level Deep Bayesian Networks

A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends

Data-Driven Root-Cause Analysis For Distributed System Anomalies

OCRCL: Online Contrastive Learning for Root Cause Localization of Business Incidents

Causality Enhanced Graph Representation Learning for Alert-Based Root Cause Analysis