Diagnosing Application-network Anomalies for Millions of IPs in Production Clouds.

Zhe Wang,Huanwu Hu,Linghe Kong,Xinlei Kang,Teng Ma,Qiao Xiang,Jingxuan Li,Yang Lu,Zhuo Song,Peihao Yang,Jiejian Wu,Yong Yang,Tao Ma,Zheng Liu,Xianlong Zeng,Dennis Cai,Guihai Chen
2024-01-01
Abstract:Timely detection and diagnosis of application-network anomalies is a key challenge of operating large-scale production clouds. We reveal three practical issues in a cloud-native era. First, impact assessment of anomalies at a (micro)service level is absent currently deployed monitoring systems. Ping systems are oblivious to the "actual weights" of application traffic, e.g., traffic volume and the number of connections/instances. Failures of critical (micro)services with large weights can be easily overlooked by probing systems under prevalent network jitters. Second, the efficiency of anomaly routing (to a blamed application/network team) is still low with multiple attribution teams involved. Third, collecting fine-grained metrics at a (micro)service level incurs considerable computational/storage overheads, however, is indispensable for accurate impact assessment and anomaly routing. We introduce the application-network diagnosing (AND) system in Alibaba cloud. AND exploits the single metric of TCP retransmission (retx) to capture anomalies at (micro)service levels and correlates applications with networks end-to-end. To resolve deployment challenges, AND further proposes three core designs: (1) a collecting tool to perform filtering/statistics on massive retxs at the (micro)service level, (2) a real-time detection procedure to extract anomalies from 'noisy' rem with millions of time series, (3) an anomaly routing model to delimit anomalies among multiple target teams/scenarios. AND has been deployed in Alibaba cloud for over three years and enables minute-level anomaly detection/routing and fast failure recovery.
What problem does this paper attempt to address?