Robust Failure Diagnosis of Microservice System through Multimodal Data

Shenglin Zhang,Pengxiang Jin,Zihan Lin,Yongqian Sun,Bicheng Zhang,Sibo Xia,Zhengdan Li,Zhenyu Zhong,Minghua Ma,Wa Jin,Dai Zhang,Zhenyu Zhu,Dan Pei
DOI: https://doi.org/10.48550/arXiv.2302.10512
2023-05-31
Abstract:Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure diagnosis methods rely solely on single-modal data (i.e., using either metrics, logs, or traces). In this study, we conduct an empirical study using real-world failure cases to show that combining these sources of data (multimodal data) leads to a more accurate diagnosis. However, effectively representing these data and addressing imbalanced failures remain challenging. To tackle these issues, we propose DiagFusion, a robust failure diagnosis approach that uses multimodal data. It leverages embedding techniques and data augmentation to represent the multimodal data of service instances, combines deployment data and traces to build a dependency graph, and uses a graph neural network to localize the root cause instance and determine the failure type. Our evaluations using real-world datasets show that DiagFusion outperforms existing methods in terms of root cause instance localization (improving by 20.9% to 368%) and failure type determination (improving by 11.0% to 169%).
Software Engineering
What problem does this paper attempt to address?
This paper aims to solve the problem of automatic fault diagnosis in microservice systems. Specifically, it attempts to improve the accuracy of fault diagnosis by combining multi - modal data (i.e., traces, logs, and metrics). Currently, most fault diagnosis methods rely on single - modal data, which has limitations in capturing fault patterns because a single data source may not fully reflect all the characteristics of a fault. In addition, some types of faults may not be manifested in a specific modality, making it difficult for methods relying on that modality to identify these faults. To overcome these problems, the paper proposes a robust fault diagnosis method named DiagFusion. DiagFusion works in the following ways: 1. **Multi - modal data representation**: DiagFusion uses embedding techniques and data augmentation techniques to represent the multi - modal data of service instances, thus forming a unified data representation. Specifically, it converts data of different modalities (such as traces, logs, and metrics) into structured events and vectors. 2. **Dependency graph construction**: DiagFusion combines deployment data and trace data to construct a dependency graph (DG), which can capture the invocation relationships between service instances and possible fault propagation paths. 3. **Graph neural network application**: DiagFusion uses a graph neural network (GNN) to locate the root - cause instance and determine the fault type. GNN learns the fault propagation pattern in the system through the message - passing mechanism, thus achieving fault diagnosis from a global perspective. Through the above methods, the evaluation results of DiagFusion on actual data sets show that its performance in root - cause instance location and fault type determination is significantly better than existing methods, with improvements ranging from 20.9% to 368% and from 11.0% to 169% respectively. In conclusion, this paper proposes a new fault diagnosis method by combining multi - modal data and graph neural networks to improve the accuracy and efficiency of fault diagnosis in microservice systems.