An Empirical Study on Change-induced Incidents of Online Service Systems.

Yifan Wu,Bingxu Chai,Ying Li,Bingchang Liu,Jianguo Li,Yong Yang,Wei Jiang
DOI: https://doi.org/10.1109/icse-seip58684.2023.00027
2023-01-01
Abstract:Although dedicated efforts have been devoted to ensuring the service quality of online service systems, these systems are still suffering from incidents due to various causes, which lead to user dissatisfaction and economic loss. Change is the most disruptive yet unavoidable maintenance event in online service systems. Among all possible causes of incidents, change is one of the leading causes that induce incidents. To enforce changes with minimized negative impact, change management has been widely applied in industry. However, change-induced incidents are still happening. Most empirical studies involving change-induced incidents are limited to one specific type of incident-inducing change. Moreover, the characteristics of change-induced incidents and challenges of change management have not been studied. To fill the knowledge gap, this paper presents the first empirical study on change-induced incidents of online service systems. 161 real change-induced incidents are collected from a large-scale online service system over two years in Ant Group. By manually examining their post-mortem reports, we clarify the severity of change-induced incidents and analyze the characteristics of change-induced incidents in terms of change types, root causes, and mitigation strategies. Furthermore, we identify a series of vital challenges of change management in practice and point out several practical implications for researchers and engineers. We believe our work could help understand change-induced incidents and give some inspiration and guidance for engineers and researchers to improve change management.
What problem does this paper attempt to address?