Efficient and Robust Trace Anomaly Detection for Large-Scale Microservice Systems
Shenglin Zhang,Zhongjie Pan,Heng Liu,Pengxiang Jin,Yongqian Sun,Qianyu Ouyang,Jiaju Wang,Xueying Jia,Yuzhi Zhang,Hui Yang,Yongqiang Zou,Dan Pei
DOI: https://doi.org/10.1109/issre59848.2023.00012
2023-01-01
Abstract:Microservice invocation anomalies can have a detrimental impact on user experience and service revenue. While existing trace anomaly detection approaches typically focus on anomalies in response time and invocation structure, they often overlook the importance of using fine-grained features to detect anomalies. Additionally, trace data obtained from real-world scenarios is typically accompanied by noise, which can hinder the effectiveness of anomaly detection approaches. Furthermore, large-scale trace data can significantly impact model training efficiency. To address these challenges, we propose TraceSieve, an unsupervised trace anomaly detection method that accurately detects trace anomalies. Our approach leverages an auto-encoder architecture within an adversarial training framework to filter out noise data. Additionally, we integrate VGAE-EWC, which combines Variational Graph Auto-Encoder (VGAE) with Elastic Weight Consolidation (EWC), to overcome the challenges of enormous time consumption during the training phase. Finally, we localize the root cause of trace anomalies. Our proposed method is evaluated using two different datasets, and our results demonstrate that TraceSieve achieves an F 1 -score of 0.970 and 0.925, respectively, outperforming state-of-the-art trace anomaly detection approaches.