Root Cause Analysis for Cloud-Native Applications

Bartosz Żurkowski,Krzysztof Zieliński
DOI: https://doi.org/10.1109/tcc.2024.3358823
IF: 5.697
2024-03-08
IEEE Transactions on Cloud Computing
Abstract:Root cause analysis (RCA) is a critical component in maintaining the reliability and performance of modern cloud applications. However, due to the inherent complexity of cloud environments, traditional RCA techniques become insufficient in supporting system administrators in daily incident response routines. This article presents an RCA solution specifically designed for cloud applications, capable of pinpointing failure root causes and recreating complete fault trajectories from the root cause to the effect. The novelty of our approach lies in approximating causal symptom dependencies by synergizing several symptom correlation methods that assess symptoms in terms of structural, semantic, and temporal aspects. The solution integrates statistical methods with system structure and behavior mining, offering a more comprehensive analysis than existing techniques. Based on these concepts, in this work, we provide definitions and construction algorithms for RCA model structures used in the inference, propose a symptom correlation framework encompassing essential elements of symptom data analysis, and provide a detailed description of the elaborated root cause identification process. Functional evaluation on a live microservice-based system demonstrates the effectiveness of our approach in identifying root causes of complex failures across multiple cloud layers.
computer science, information systems, theory & methods
What problem does this paper attempt to address?