DCQCN+: Taming Large-Scale Incast Congestion in RDMA over Ethernet Networks

Yixiao Gao,Yuchen Yang,Tian Chen,Jiaqi Zheng,Bing Mao,Guihai Chen
DOI: https://doi.org/10.1109/icnp.2018.00021
2018-01-01
Abstract:Remote Direct Memory Access (RDMA) gains growing popularity in datacenter networks. The state-of-the-art congestion control scheme is DCQCN. However, DCQCN has performance problems when large-scale incast communication happens. DCQCN uses fixed period and steps for rate increase when probing for available bandwidth and this scheme is not scalable. Our key insight is that: senders should be aware of the scale of each incast, so that they can adjust their aggressiveness accordingly. The challenges come from different aspects. The scale of congestion is not easy to estimate while the control scheme should be cautiously designed. In this paper, we propose DCQCN+ to improve performance for large-scale incast congestion in RDMA networks. DCQCN+ adapts the rate control mechanisms to different scenarios. DCQCN+ can deal with incast congestion of at least 2,000 flows both in simulation and testbed. The scale is 10 times larger than that of DCQCN in simulation and 4 times larger in testbed. DCQCN+ also has 10 times smaller latency.
What problem does this paper attempt to address?