Identifying Root-Cause Changes for User-Reported Incidents in Online Service Systems

Yujin Zhao,Ling Jiang,Ye Tao,Songlin Zhang,Changlong Wu,Tong Jia,Xiaosong Huang,Ying Li,Zhonghai Wu
DOI: https://doi.org/10.1109/issre59848.2023.00028
2023-01-01
Abstract:In online service systems, a majority of incidents are caused by changes, which can influence user experience and cause huge economic loss. Experiences with a real-world, large-scale online service system show that more than half of the change-induced incidents are reported by users. Identifying root-cause changes for these incidents is challenging due to the inherent gap between user-perceived functional-level incident information and component-level change details. Inadequate causal knowledge also brings challenges. In this paper, we propose a novel causal knowledge mining based approach aiming at root-cause change identification for user-reported incidents named Raccoon. To bridge the gap between incidents and changes, it utilizes the fault tree and software product line to represent incidents and changes at the user-perceived functional level. They are also used as the backbone of causal knowledge. To overcome the lack of causal knowledge, Raccoon adopts efficient knowledge extraction and inference methods. Moreover, Raccoon provides recommendations at the software product line and change granularity to meet diverse demands of incident triage and root-cause change identification scenarios in incident management. We evaluate Raccoon on a real-world dataset collected in a large-scale online service system. The result shows that Raccoon significantly outperforms the state-of-the-art baseline approaches, which proves its effectiveness.
What problem does this paper attempt to address?