Abstract:In recent years, microservices have gained widespread adoption in IT operations due to their scalability, maintenance, and flexibility. However, it becomes challenging for site reliability engineers (SREs) to pinpoint the root cause due to the complex relationships in microservices when facing system malfunctions. Previous research employed structured learning methods (e.g., PC-algorithm) to establish causal relationships and derive root causes from causal graphs. Nevertheless, they ignored the temporal order of time series data and failed to leverage the rich information inherent in the temporal relationships. For instance, in cases where there is a sudden spike in CPU utilization, it can lead to an increase in latency for other microservices. However, in this scenario, the anomaly in CPU utilization occurs before the latency increase, rather than simultaneously. As a result, the PC-algorithm fails to capture such characteristics. To address these challenges, we propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning. RUN enhances the backbone encoder by integrating contextual information from time series, and leverages a time series forecasting model to conduct neural Granger causal discovery. In addition, RUN incorporates Pagerank with a personalization vector to efficiently recommend the top-k root causes. Extensive experiments conducted on the synthetic and real-world microservice-based datasets demonstrate that RUN noticeably outperforms the state-of-the-art root cause analysis methods. Moreover, we provide an analysis scenario for the sock-shop case to showcase the practicality and efficacy of RUN in microservice-based applications. Our code is publicly available at

MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

Localizing Failure Root Causes in a Microservice through Causality Inference

Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments

Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture

HeMiRCA: Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data Sources

BertHTLG: Graph-Based Microservice Anomaly Detection Through Sentence-Bert Enhancement.

Diagnosing Performance Issues for Large-Scale Microservice Systems with Heterogeneous Graph

Root-Cause Metric Location for Microservice Systems Via Log Anomaly Detection

OCRCL: Online Contrastive Learning for Root Cause Localization of Business Incidents

CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems

DyCause: Crowdsourcing to Diagnose Microservice Kernel Failure

Root Cause Analysis In Microservice Using Neural Granger Causal Discovery

Root Cause Analysis of Anomalies of Multitier Services in Public Clouds.

Multi-stage Location for Root-Cause Metrics in Online Service Systems.

ServiceRank: Root Cause Identification of Anomaly in Large-Scale Microservice Architectures

Fault Diagnosis for Test Alarms in Microservices Through Multi-source Data

Practical Root Cause Localization for Microservice Systems Via Trace Analysis

Microservice Anomaly Detection Based on Tracing Data Using Semi-supervised Learning

G-Cause: Parameter-free Global Diagnosis for Hyperscale Web Service Infrastructures

Multilayered Fault Detection and Localization With Transformer for Microservice Systems

Interpretable Failure Localization for Microservice Systems Based on Graph Autoencoder