Towards Automatic Root Cause Diagnosis of Persistent Packet Loss in Cloud Overlay Network
Chongrong Fang,Haoyu Liu,Mao Miao,Jie Ye,Lei Wang,Wansheng Zhang,Daxiang Kang,Biao Lyu,Shunmin Zhu,Peng Cheng,Jiming Chen
DOI: https://doi.org/10.1109/tnet.2021.3137557
2022-01-01
IEEE/ACM Transactions on Networking
Abstract:Persistent packet loss in the cloud-scale overlay network severely compromises tenant experiences. Cloud providers are keen to diagnose such problems efficiently. However, existing work is either designed for the physical network or insufficient to present the concrete reason of packet loss. We propose to record and analyze the on-site forwarding condition of packets during packet-level tracing. The cloud-scale overlay network presents great challenges to achieve this goal with its high network complexity, multi-tenant nature, and diversity of root causes. To address these challenges, we present VTrace, an automatic diagnostic system for persistent packet loss over the cloud-scale overlay network. Utilizing the ''fast path-slow path'' structure of virtual forwarding devices (VFDs), e.g., vSwitches, VTrace installs several ''coloring-matching-logging'' rules in VFDs to selectively track the target packets and inspect them in depth. The detailed forwarding situation at each hop is logged and then assembled to perform analysis with an efficient path reconstruction scheme. Experiments are conducted to demonstrate VTrace's low overhead and quick response. Besides, based on the idea ''coloring-matching-counting'', VTrace can be easily extended to VTrace-stats to identify the culprit device for transient packet loss. We share experiences of how VTrace and VTrace-stats efficiently work after deploying them in Alibaba Cloud for years.