No More Data Silos: Unified Microservice Failure Diagnosis with Temporal Knowledge Graph
Shenglin Zhang,Yongxin Zhao,Sibo Xia,Shirui Wei,Yongqian Sun,Chenyu Zhao,Shiyu Ma,Junhua Kuang,Bolin Zhu,Lemeng Pan,Yicheng Guo,Dan Pei
DOI: https://doi.org/10.1109/tsc.2024.3489444
IF: 11.019
2024-01-01
IEEE Transactions on Services Computing
Abstract:Microservices improve the scalability and flexibility of monolithic architectures to accommodate the evolution of software systems, but the complexity and dynamics of microservices challenge system reliability. Ensuring microservice quality requires efficient failure diagnosis, including detection and triage. Failure detection involves identifying anomalous behavior within the system, while triage entails classifying the failure type and directing it to the engineering team for resolution. Unfortunately, current approaches reliant on single-modal monitoring data, such as metrics, logs, or traces, cannot capture all failures and neglect interconnections among multimodal data, leading to erroneous diagnoses. Recent multimodal data fusion studies struggle to achieve deep integration, limiting diagnostic accuracy due to insufficiently captured interdependencies. Therefore, we propose UniDiag , which leverages temporal knowledge graphs to fuse multimodal data for effective failure diagnosis. UniDiag applies a simple yet effective stream-based anomaly detection method to reduce computational cost and a novel microservice-oriented graph embedding method to represent the state of systems comprehensively. To assess the performance of UniDiag , we conduct extensive evaluation experiments using datasets from two benchmark microservice systems, demonstrating its superiority over existing methods and affirming the efficacy of multimodal data fusion. Additionally, we have publicly made the code and data available to facilitate further research.