SparseRCA: Unsupervised Root Cause Analysis in Sparse Microservice Testing Traces

Zhenhe Yao,Haowei Ye,Changhua Pei,Guang Cheng,Guangpei Wang,Zhiwei Liu,Hongwei Chen,Hang Cui,Zeyan Li,Jianhui Li,Gaogang Xie,Dan Pei
DOI: https://doi.org/10.1109/issre62328.2024.00045
2024-01-01
Abstract:Microservice architecture has become a predominant paradigm in the software industry. This architecture necessitates robust end-to-end testing to ensure seamless integration of all components before deployment. Rapidly pinpointing issues when test cases fail is crucial for enhancing software development efficiency. However, in testing environments, the available trace is often sparse, and the system is continuously upgrading, which renders existing microservice-based root cause analysis (RCA) ineffective. To address these challenges, we propose SparseRCA. By assessing the abnormality of the exclusive latency, SparseRCA directly determines the probability of the root cause, solving the challenge of not being able to fully obtain the fault propagation information, such as call relationships in sparse trace scenarios. At the same time, by reconstructing the exclusive latency using the decoupled atomic span units, it solves the problem of latency prediction for new traces caused by frequent upgrades. We evaluate SparseRCA on real-world datasets from a large e-commerce system’s testing environment, where it demonstrates significant improvements over existing models. Our findings underscore the effectiveness of SparseRCA in addressing the challenges of RCA in microservice testing environments.
What problem does this paper attempt to address?