Understanding and predicting incident mitigation time

Weijing Wang,Junjie Chen,Lin Yang,Hongyu Zhang
DOI: https://doi.org/10.1016/j.infsof.2022.107119
IF: 3.9
2022-11-26
Information and Software Technology
Abstract:Context: Incident management plays a significant role in online service systems. Incidents should be mitigated as soon as possible in order to achieve high service stability. However, available resources tend to be limited, and thus engineers have to schedule their tasks carefully. Time to Mitigate (TTM) refers to the time an incident requires to restore the service availability. Predicting TTM can help better estimate maintenance efforts and provide developers more information when arranging their tasks. Objective: Our work aims to predict TTM precisely, which consists of two main steps. First, we perform an empirical study to understand incidents deeply. Then, we design an effective approach for TTM prediction based on the findings from the empirical study. Methods: In the empirical study, we used 20 Microsoft online service systems to investigate the duration of each stage in incident management and the relationship between TTM and incident indicators. Then, we propose TTMPred, a deep-learning-based approach for TTM prediction in the continuous triage scenario based on the features identified from our empirical study. In particular, we improve the generality of TTMPred by extending it to predicting the fixing time of traditional software bugs. Results: We investigate the effectiveness of TTMPred on four large-scale online service systems in Microsoft, as well as four widely-used Bugzilla-based projects. The results show that TTMPred performs better than the compared approaches for both incident TTM prediction and bug-fixing time prediction. For example, on average, TTMPred improves the state-of-the-art regression-based approach by 25.66% in terms of MAE (Mean Absolute Error) on the incident data and 42.14% on MAE on the bug data. Conclusion: TTMPred can be extended to the bug scenario, and continuously predict accurate bug-fixing time during the triage process.
computer science, information systems, software engineering
What problem does this paper attempt to address?