Practical Root Cause Localization for Microservice Systems Via Trace Analysis
Zeyan Li,Junjie Chen,Rui Jiao,Nengwen Zhao,Zhijun Wang,Shuwei Zhang,Yanjun Wu,Long Jiang,Leiqin Yan,Zikai Wang,Zhekang Chen,Wenchi Zhang,Xiaohui Nie,Kaixin Sui,Dan Pei
DOI: https://doi.org/10.1109/iwqos52092.2021.9521340
2021-01-01
Abstract:Microservice architecture is applied by an increasing number of systems because of its benefits on delivery, scalability, and autonomy. It is essential but challenging to localize root-cause microservices promptly when a fault occurs. Traces are helpful for root-cause microservice localization, and thus many recent approaches utilize them. However, these approaches are less practical due to relying on supervision or other unrealistic assumptions. To overcome their limitations, we propose a more practical root-cause microservice localization approach named TraceRCA. The key insight of TraceRCA is that a microservice with more abnormal and less normal traces passing through it is more likely to be the root cause. Based on it, TraceRCA is composed of trace anomaly detection, suspicious microservice set mining and microservice ranking. We conducted experiments on hundreds of injected faults in a widely-used open-source microservice benchmark and a production system. The results show that TraceRCA is effective in various situations. The top-1 accuracy of TraceRCA outperforms the state-of-the-art unsupervised approaches by 44.8%. Besides, TraceRCA is applied in a large commercial bank, and it helps operators localize root causes for real-world faults accurately and efficiently. We also share some lessons learned from our real-world deployment.