G-Cause: Parameter-free Global Diagnosis for Hyperscale Web Service Infrastructures

Xinrui Jiang,Yang Zhang,Tingzhu Bi,Xiangzhuang Shen,Yu Zhang,Yicheng Pan,Meng Ma,Linlin Han,Feng Wang,Xian Liu,Ping Wang
DOI: https://doi.org/10.1109/icws62655.2024.00119
2024-01-01
Abstract:Hyperscale web service infrastructures are becoming increasingly complex and facing a variety of threats, raising the demand for more sophisticated automated operations and diagnosis solutions. Existing anomaly root cause localization approaches often focus on Service-level components without drilling down to the lower-level resources where services are deployed, hindering the implementation of fine-grained failure fix measures. This paper introduces a challenging task called global diagnosis and addresses it by proposing a technique called G-Cause, which is applicable to both Service-level and host-level root cause analysis scenarios. G-Cause builds a highly adaptive diagnostic framework based on the frequency-domain and time-domain characteristics of monitoring metrics, allowing it to handle global diagnosis requirements from app to host with minimal parameter adjustments. We deploy and validate our approach in two typical scenarios: homogeneous metric diagnosis from app to microservice, and heterogeneous metric diagnosis for various host resources. The results demonstrate that G-Cause outperforms state-of-the-art diagnosis algorithms while providing strong interpretability. Our approach helps operators understand the core mechanism of anomaly propagation and adjust their management strategies more effectively. With these strengths, G-Cause successfully services our global product operations and also makes an impressive contribution in many other workflows.
What problem does this paper attempt to address?