Zoonet: a proactive telemetry system for large-scale cloud networks.

Shunmin Zhu,Jianyuan Lu,Biao Lyu,Tian Pan,Chenhao Jia,Xin Cheng,Daxiang Kang,Yilong Lv,Fukun Yang,Xiaobo Xue,Zhiliang Wang,Jiahai Yang
DOI: https://doi.org/10.1145/3555050.3569116
2022-01-01
Abstract:We present Zoonet, a proactive virtual network telemetry system for multi-tenant clouds. The requirements are to (1) cover hyper-scale virtual networks with millions of tenants and millions of VMs for top tenants; (2) handle frequent virtual topology changes due to tenants' configuration through flexible APIs; (3) adapt to heterogeneous middleboxes along the probing paths; (4) achieve VM-to-VM telemetry without breaking tenant privacy; (5) differentiate virtual and physical network problems. We argue existing physical network telemetry solutions fail to satisfy our needs due to either incomplete telemetry coverage or outrageous telemetry overhead. Zoonet sets an ambitious goal to provide VM-to-VM hop-by-hop telemetry for each tenant, which is achieved based on self-developed, customizable middleboxes via hundreds of person-months under close team collaboration. At the data plane, Zoonet defines an elegant generalization of ping and traceroute, but made to work on multi-tenant clouds with heterogeneous middleboxes. At the control plane, Zoonet conducts substantial probing path pruning and update batch processing to lessen the overhead. Zoonet has been deployed in Alibaba Cloud for over two years, covering tens of cloud regions, hundreds of thousands of servers. We become increasingly reliant on Zoonet as it reduces 86% of the personnel engaged in troubleshooting.
What problem does this paper attempt to address?