LogRCA: Log-based Root Cause Analysis for Distributed Services

Thorsten Wittkopp,Philipp Wiesner,Odej Kao
2024-05-22
Abstract:To assist IT service developers and operators in managing their increasingly complex service landscapes, there is a growing effort to leverage artificial intelligence in operations. To speed up troubleshooting, log anomaly detection has received much attention in particular, dealing with the identification of log events that indicate the reasons for a system failure. However, faults often propagate extensively within systems, which can result in a large number of anomalies being detected by existing approaches. In this case, it can remain very challenging for users to quickly identify the actual root cause of a failure.
Machine Learning
What problem does this paper attempt to address?
The paper primarily addresses the problem of quickly and accurately identifying the root cause of system failures in distributed services. Specifically, the paper proposes a new method called LogRCA, which aims to identify the smallest set of log lines from a large volume of log data that collectively describe the root cause of a failure. Existing log anomaly detection methods can identify log events that indicate system failure causes, but in complex service architectures, failures often propagate widely, leading to a large number of anomalies being detected. This makes it difficult for users to quickly pinpoint the actual root cause of the failure. Therefore, LogRCA adopts a semi-supervised learning approach to handle rare and unknown errors and can process noisy data. The main contributions of LogRCA include: 1. Proposing a method to identify the set of log lines that describe the root cause of system failures. This method utilizes semi-supervised learning techniques and a Transformer model based on a custom objective function, capable of handling very noisy data. 2. Introducing a method to improve performance on rare or unknown failures by balancing the training data. This method estimates the number of different root causes and their occurrences in the training dataset through automatic clustering, thereby balancing the training data. 3. Evaluating the LogRCA method on a large-scale production dataset containing 44.3 million logs and comparing it with baseline methods based on deep learning and statistical analysis. Experimental results show that LogRCA outperforms baseline methods in terms of consistency and accuracy in detecting candidate root causes. In summary, LogRCA aims to simplify the fault diagnosis process by reducing the number of logs that users need to review, thereby helping developers and operations teams understand and resolve system failures more quickly.