Understanding and Improving Change Risk Detection in Practice

Yifan Wu,Yunpeng Wang,Jianguo Li,Ying Li,Bingxu Chai,Wei Jiang
DOI: https://doi.org/10.1109/saner60148.2024.00079
2024-01-01
Abstract:Changes are inevitable and frequent in large-scale online service systems, which has been one of the leading causes that induce incidents. Change risk detection (CRD) aims to help engineers detect high-risk changes so that proactive actions can be taken to avoid incidents, which is vital for the availability and reliability of online service systems. Though some efforts have been dedicated to CRD, their performances are still far from satisfactory in practice. To better understand the practical challenges of CRD, we conducted the first empirical study on a large-scale online service system in Ant Group. Through this study, we identified four critical challenges, including poor interpretability, adaptation to diverse change types, indirect anomaly factors, and expected but false alarm anomalies. To address these challenges, we propose an effective and eXplainable Change Risk Detection framework named XCRD. XCRD can detect change-induced unexpected anomalies using multi-source data and provide explainable alerts for engineers to facilitate anomaly diagnosis and mitigation. We have successfully deployed XCRD in Ant Group for the past 14 months, demonstrating a significant performance improvement in CRD. We also discuss some successful cases and lessons learned during our study. To our knowledge, we are the first to deeply investigate CRD in industrial scenarios. We believe that our work can provide valuable insights for engineers and researchers to understand and improve CRD in practice.
What problem does this paper attempt to address?