Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating Systems

Shenglin Zhang,Yongxin Zhao,Xiao Xiong,Yongqian Sun,Xiaohui Nie,Jiacheng Zhang,Fenglai Wang,Xian Zheng,Yuzhi Zhang,Dan Pei
DOI: https://doi.org/10.1145/3663529.3663834
2024-01-01
Abstract:Timely localization of the root causes of gray failure is essential for maintaining the stability of the server OS. The previous intrusive gray failure localization methods usually require modifying the source code of applications, limiting their practical deployment. In this paper, we propose GrayScope, a method for non-intrusively localizing the root causes of gray failures based on the metric data in the server OS. Its core idea is to combine expert knowledge with causal learning techniques to capture more reliable inter-metric causal relationships. It then incorporates metric correlations and anomaly degrees, aiding in identifying potential root causes of gray failures. Additionally, it infers the gray failure propagation paths between metrics, providing interpretability and enhancing operators’ efficiency in mitigating gray failures. We evaluate GrayScope’s performance based on 1241 injected gray failure cases and 135 ones from industrial experiments in Huawei. GrayScope achieves the AC@5 of 90% and interpretability accuracy of 81%, significantly outperforming popular root cause localization methods. Additionally, we have made the code publicly available to facilitate further research.
What problem does this paper attempt to address?