Guardian of the Resiliency: Detecting Erroneous Software Changes Before They Make Your Microservice System Less Fault-Resilient

Guanglei He,Xiaohui Nie,Ruming Tang,Kun Wang,Zhaoyang Yu,Xidao Wen,Kanglin Yin,Dan Pei
DOI: https://doi.org/10.1109/iwqos61813.2024.10682951
2024-01-01
Abstract:The microservice system’s resilience is crucial for ensuring the quality of service. Nowadays, software changes are frequent and error-prone, and erroneous software changes could reduce microservice systems’ resilience to handle faults, leading to service failures and negatively impacting user experience. To better understand erroneous software changes, we conducted an empirical study on 256 real-world incidents from four famous microservice systems. Our quantitative results indicate that 37.87% of erroneous software changes make the microservice systems less fault-resilient; that is, when a fault (e.g., network fluctuation, high CPU usage, etc.) happens in the system after the software change, the services are more likely to experience failures. We refer to these software changes as Erroneous Software Changes that Reduce fault Resilience(ESCR). Traditional methods struggle to detect ESCRs effectively because the occurrence of faults is unpredictable and can hardly be in their post-change monitoring windows. In this paper, we propose a novel framework named ResilienceGuardian, aiming to detect ESCRs before they make microservice systems less fault-resilient. The key idea is utilizing fault injection techniques to evaluate systems’ fault resilience in the staging environment and then training lightweight classifiers of KPI segment pairs to detect ESCRs. The performance of ResilienceGuardian is systematically evaluated on three datasets with various faults and erroneous software changes. The results show that ResilienceGuardian significantly outperforms all the baselines with a 0.9 F1-score in identifying ESCRs and reduces the training time by 56.23% to 97.53%. Besides, ResilienceGuardian can achieve minute-level ESCR detection in large-scale microservice systems.
What problem does this paper attempt to address?