MegaTE: Extending WAN Traffic Engineering to Millions of Endpoints in Virtualized Cloud

Congcong Miao,Zhizhen Zhong,Yunming Xiao,Feng Yang,Senkuo Zhang,Yinan Jiang,Zizhuo Bai,Chaodong Lu,Jingyi Geng,Zekun He,Yachen Wang,Xianneng Zou,Chuanchuan Yang
DOI: https://doi.org/10.1145/3651890.3672242
2024-01-01
Abstract:In today's virtualized cloud, containers and virtual machines (VMs) are prevailing methods to deploy applications with different tenant requirements. However, these requirements are at odds with the resource allocation capabilities of conventional networking stacks in wide-area networks (WANs). In particular, existing WAN traffic engineering (TE) systems at the granularity of aggregated traffic flows are not designed to cater to each individual flow. In this paper, we advocate for a radical new approach to extend TE systems to involve millions of virtual instance endpoints. We propose and implement a first-of-its-kind system, called MegaTE, to satisfy the needs of each fine-grained traffic flow at the virtual instance level. At the core of the MegaTE system is the paradigm shift from the top-down centralized control to the bottom-up asynchronous query in the TE control loop, combined with eBPF-based segment routing on the data plane and TE optimization contraction on the control plane. We evaluate MegaTE using flow-level simulations with production traffic traces. Our results show that MegaTE supports 20× more endpoints with the similar algorithm run time compared to prior work. MegaTE has been adopted by large-scale public cloud providers. Notably, Tencent rolled out MegaTE in its cloud WAN since December 2022. Our production analysis shows that MegaTE reduces the packet latency of real-time applications by up to 51%.
What problem does this paper attempt to address?