Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?

Luan Pham,Huong Ha,Hongyu Zhang
2024-08-25
Abstract:Microservice architecture has become a popular architecture adopted by many cloud applications. However, identifying the root cause of a failure in microservice systems is still a challenging and time-consuming task. In recent years, researchers have introduced various causal inference-based root cause analysis methods to assist engineers in identifying the root causes. To gain a better understanding of the current status of causal inference-based root cause analysis techniques for microservice systems, we conduct a comprehensive evaluation of nine causal discovery methods and twenty-one root cause analysis methods. Our evaluation aims to understand both the effectiveness and efficiency of causal inference-based root cause analysis methods, as well as other factors that affect their performance. Our experimental results and analyses indicate that no method stands out in all situations; each method tends to either fall short in effectiveness, efficiency, or shows sensitivity to specific parameters. Notably, the performance of root cause analysis methods on synthetic datasets may not accurately reflect their performance in real systems. Indeed, there is still a large room for further improvement. Furthermore, we also suggest possible future work based on our findings.
Software Engineering
What problem does this paper attempt to address?
The paper attempts to address the challenge of identifying the root cause of failures (Root Cause Analysis, RCA) in microservice systems. Specifically, while microservice architecture offers significant advantages in scalability, resilience, and flexibility, determining the root cause of failures becomes very difficult and time-consuming due to the complexity of microservice systems and the high coupling between services. Existing RCA methods based on causal inference have limitations in effectiveness and efficiency when dealing with large-scale microservice systems and are sensitive to specific parameters. Therefore, this paper aims to understand the performance of these methods in different scenarios through a comprehensive evaluation of existing causal discovery and RCA methods and to explore the possibilities for further improvement. The main contributions of the paper include: 1. **Comprehensive Evaluation**: A comprehensive evaluation of nine causal discovery methods and 21 RCA methods, covering synthetic datasets and real datasets from three benchmark microservice systems. 2. **Performance Analysis**: Evaluation of these methods in terms of effectiveness and efficiency, and analysis of factors affecting their performance, such as input data length, hyperparameter tuning, etc. 3. **Future Research Directions**: Based on the evaluation results, future research directions are proposed, especially the challenges of applying causal inference RCA methods in large-scale microservice systems. Through this work, the paper hopes to provide more effective tools and methods for fault diagnosis in microservice systems.