Proactive Telemetry in Large-Scale Multi-Tenant Cloud Overlay Networks

Shunmin Zhu,Jianyuan Lu,Biao Lyu,Tian Pan,Shize Zhang,Xiaoqing Sun,Chenhao Jia,Xin Cheng,Daxiang Kang,Yilong Lv,Fukun Yang,Xiaobo Xue,Xihui Yang,Zhiliang Wang,Jiahai Yang
DOI: https://doi.org/10.1109/tnet.2024.3381786
2024-01-01
Abstract:At present, public clouds have served millions of tenants. To provide reliable services, cloud vendors need to perceive health status of the cloud network by building a telemetry system to detect possible network failures. While telemetry systems for physical networks have been extensively studied, research on telemetry systems for virtual networks is still insufficient. Different from physical networks, we conclude that building a virtual network telemetry system faces new challenges of feasibility, efficiency, and effectiveness. Specifically, we need to 1) protect privacy of tenants and adapt to heterogeneous middleboxes at the data plane; 2) handle frequent virtual network topology updates and compress large-scale measurement paths for millions of tenants at the control plane; 3) analyze telemetry results to locate network failures at the analysis plane. To address these challenges, we present Zoonet, a proactive virtual network telemetry system for multi-tenant clouds. At the data plane, Zoonet uses host agent and arp-ping to protect tenants’ privacy and defines an elegant generalization of ping and traceroute, which can work on heterogeneous middleboxes. At the control plane, Zoonet conducts update batch processing and substantial probing path pruning to lessen the overhead. At the analysis plane, Zoonet reduces noises and aggregates alerts based on temporal and spatial correlation and conducts the hop-by-hop telemetry mode to locate failures. Zoonet has been deployed in Alibaba Cloud for over two years, covering tens of cloud regions, hundreds of thousands of servers. We become increasingly reliant on Zoonet as it reduces 86% of the personnel engaged in troubleshooting.
What problem does this paper attempt to address?