Interpretable Failure Localization for Microservice Systems Based on Graph Autoencoder
Yongqian Sun,Zihan Lin,Binpeng Shi,Shenglin Zhang,Shiyu Ma,Pengxiang Jin,Zhenyu Zhong,Lemeng Pan,Yicheng Guo,Dan Pei
DOI: https://doi.org/10.1145/3695999
IF: 3.685
2024-09-13
ACM Transactions on Software Engineering and Methodology
Abstract:Accurate and efficient localization of root cause instances in large-scale microservice systems is of paramount importance. Unfortunately, prevailing methods face several limitations. Notably, some recent methods rely on supervised learning which necessitates a substantial amount of labeled data. However, labeling root cause instances is time-consuming and laborious, especially with multiple modalities of data including logs, traces, metrics, etc. Moreover, some approaches favor deep learning for localization but lack interpretability and continuous improvement mechanisms. To address the above challenges, we propose DeepHunt, a novel root cause localization method based on multimodal data analysis. Firstly, DeepHunt introduces Root Cause Score (RCS) by integrating reconstruction errors and failure propagation patterns (upstream-downstream relationships), imparting interpretability to the localization of root causes. Then, it embraces Graph Autoencoder (GAE) to address the limitation imposed by scarce labeled data. It employs data augmentation to mitigate the adverse effects of insufficient historical training samples. We evaluate DeepHunt on two open-source datasets, and it outperforms existing methods when facing a zero-label cold start. DeepHunt can be further improved by continuously fine-tuning through a feedback mechanism.
computer science, software engineering