TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems

Ruomeng Ding,Chaoyun Zhang,Lu Wang,Yong Xu,Minghua Ma,Xiaomin Wu,Meng Zhang,Qingjun Chen,Xin Gao,Xuedong Gao,Hao Fan,Saravan Rajmohan,Qingwei Lin,Dongmei Zhang
2023-10-28
Abstract:Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems. However, performing RCA on modern microservice systems can be challenging due to their large scale, as they usually comprise hundreds of components, leading significant human effort. This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems. It leverages reinforcement learning to learn a pruning policy for the service dependency graph to automatically eliminates redundant components, thereby significantly improving the RCA efficiency. The learned pruning policy is interpretable and fully adaptive to new RCA instances. With the pruned graph, a causal-based method can be executed with high accuracy and efficiency. The proposed TraceDiag framework is evaluated on real data traces collected from the Microsoft Exchange system, and demonstrates superior performance compared to state-of-the-art RCA approaches. Notably, TraceDiag has been integrated as a critical component in the Microsoft M365 Exchange, resulting in a significant improvement in the system's reliability and a considerable reduction in the human effort required for RCA.
Software Engineering
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of Root Cause Analysis (RCA) in large-scale microservice systems. Specifically: 1. **Challenges of Complexity and Scale**: Modern microservice systems typically consist of hundreds of components, making root cause analysis very complex and time-consuming. Traditional RCA methods can take hours to complete without automated tools, making the RCA process both time-consuming and labor-intensive. 2. **Elimination of Redundant Components**: In large-scale microservice systems, there are many redundant components that do not affect failures but increase the complexity of the RCA process. Therefore, effectively identifying and eliminating these redundant components becomes a key issue. 3. **Adaptive and Explainable Strategies**: Current RCA methods mostly rely on engineers' experience and lack unified standards and rules. Therefore, there is an urgent need for an adaptive service graph pruning strategy to improve the efficiency and accuracy of RCA while maintaining the explainability of the pruning process. ### Main Contributions of the Paper To address the above challenges, the paper proposes a new framework called **TraceDiag**. This framework uses Reinforcement Learning (RL) to obtain an automated, explainable, and adaptive pruning strategy, thereby effectively removing redundant components and improving the efficiency and accuracy of RCA. The specific contributions are as follows: 1. **End-to-End RCA Framework**: TraceDiag integrates graph pruning technology and achieves higher accuracy and robustness than correlation analysis through causal relationship analysis methods. 2. **Reinforcement Learning-Driven Pruning**: By selecting actions from a predefined pool of pruning actions, it ensures the explainability of the pruning process and adaptability to new RCA instances. 3. **Practical Application Effectiveness**: By evaluating actual data from the Microsoft Exchange system, it demonstrates performance superior to existing RCA methods and has been successfully integrated into the Microsoft M365 Exchange system, significantly improving system reliability and greatly reducing the need for manual intervention.