Diagnosing Performance Issues for Large-Scale Microservice Systems with Heterogeneous Graph
Lei Tao,Xianglin Lu,Shenglin Zhang,Jiaqi Luan,Yingke Li,Mingjie Li,Zeyan Li,Qingyang Yu,Hucheng Xie,Ruijie Xu,Chenyuan Hu,Canqun Yang,Dan Pei
DOI: https://doi.org/10.1109/tsc.2024.3402172
2024-01-01
Abstract:The availability of microservice systems is critical to business operations and corporate reputation. However, the dynamics and complexity of microservice systems introduce significant challenges to the performance issue diagnosis of large-scale microservice systems. After investigating hundreds of real-world performance issue cases in Tencent, we find that previous troubleshooting approaches fail to accurately localize root causes because they overlook the inconsistency between causality and calling relationships. Therefore, we propose a novel approach, MicroDig, to diagnose performance issues for large-scale microservice systems. Specifically, MicroDig constructs a heterogeneous propagation graph to capture the causal relationships between calls and microservices. It then conducts a heterogeneity-oriented random walk (HORW) to pinpoint the culprit microservice. Extensive evaluation experiments have been conducted to evaluate MicroDig's performance on 60 real-world performance issues collected from Tencent, 80 manually injected ones collected from a widely used open-source microservice system and 128 performance issues collected from an e-commerce system used by a top-tier global commercial bank. MicroDig achieves 94.1%, 85.5% and 93.8% top-3 accuracy on the three datasets, respectively, significantly outperforming six popular baseline methods. Additionally, we have shared our success stories and learned lessons from the deployment of MicroDig in Tencent.