Abstract:To assist IT service developers and operators in managing their increasingly complex service landscapes, there is a growing effort to leverage artificial intelligence in operations. To speed up troubleshooting, log anomaly detection has received much attention in particular, dealing with the identification of log events that indicate the reasons for a system failure. However, faults often propagate extensively within systems, which can result in a large number of anomalies being detected by existing approaches. In this case, it can remain very challenging for users to quickly identify the actual root cause of a failure.

What problem does this paper attempt to address?

The paper primarily addresses the problem of quickly and accurately identifying the root cause of system failures in distributed services. Specifically, the paper proposes a new method called LogRCA, which aims to identify the smallest set of log lines from a large volume of log data that collectively describe the root cause of a failure. Existing log anomaly detection methods can identify log events that indicate system failure causes, but in complex service architectures, failures often propagate widely, leading to a large number of anomalies being detected. This makes it difficult for users to quickly pinpoint the actual root cause of the failure. Therefore, LogRCA adopts a semi-supervised learning approach to handle rare and unknown errors and can process noisy data. The main contributions of LogRCA include: 1. Proposing a method to identify the set of log lines that describe the root cause of system failures. This method utilizes semi-supervised learning techniques and a Transformer model based on a custom objective function, capable of handling very noisy data. 2. Introducing a method to improve performance on rare or unknown failures by balancing the training data. This method estimates the number of different root causes and their occurrences in the training dataset through automatic clustering, thereby balancing the training data. 3. Evaluating the LogRCA method on a large-scale production dataset containing 44.3 million logs and comparing it with baseline methods based on deep learning and statistical analysis. Experimental results show that LogRCA outperforms baseline methods in terms of consistency and accuracy in detecting candidate root causes. In summary, LogRCA aims to simplify the fault diagnosis process by reducing the number of logs that users need to review, thereby helping developers and operations teams understand and resolve system failures more quickly.

LogRCA: Log-based Root Cause Analysis for Distributed Services

Automated Root Cause Analysis with Observability Data - A Comprehensive Review

ServerRCA: Root Cause Analysis for Server Failure Using Operating System Logs

Root-Cause Metric Location for Microservice Systems Via Log Anomaly Detection

Online Multi-modal Root Cause Analysis

Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings

Exploring LLM-based Agents for Root Cause Analysis

A Real-Time Trace-Level Root-Cause Diagnosis System in Alibaba Datacenters

A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends

Microservice Root Cause Analysis with Limited Observability Through Intervention Recognition in the Latent Space

Root Cause Analysis for Cloud-Native Applications

Data-Driven Root-Cause Analysis For Distributed System Anomalies

KGroot: Enhancing Root Cause Analysis through Knowledge Graphs and Graph Convolutional Neural Networks

Disentangled Causal Graph Learning for Online Unsupervised Root Cause Analysis

PyRCA: A Library for Metric-based Root Cause Analysis

Root Cause Analysis of Outliers with Missing Structural Knowledge

SparseRCA: Unsupervised Root Cause Analysis in Sparse Microservice Testing Traces

TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems

Progressing from Anomaly Detection to Automated Log Labeling and Pioneering Root Cause Analysis

Practical Root Cause Localization for Microservice Systems Via Trace Analysis