Analyzing and Optimizing Packet Corruption in RDMA Network

Yi-Xiao Gao,Chen Tian,Wei Chen,Duo-Xing Li,Jian Yan,Yuan-Yuan Gong,Bing-Quan Wang,Tao Wu,Lei Han,Fa-Zhi Qi,Shan Zeng,Wan-Chun Dou,Gui-Hai Chen
DOI: https://doi.org/10.1007/s11390-022-2123-8
IF: 1.871
2022-01-01
Journal of Computer Science and Technology
Abstract:Remote direct memory access (RDMA) has become one of the state-of-the-art high-performance network technologies in datacenters. The reliable transport of RDMA is designed based on a lossless underlying network and cannot endure a high packet loss rate. However, except for switch buffer overflow, there is another kind of packet loss in the RDMA network, i.e., packet corruption, which has not been discussed in depth. The packet corruption incurs long application tail latency by causing timeout retransmissions. The challenges to solving packet corruption in the RDMA network include: 1) packet corruption is inevitable with any remedial mechanisms and 2) RDMA hardware is not programmable. This paper proposes some designs which can guarantee the expected tail latency of applications with the existence of packet corruption. The key idea is controlling the occurring probabilities of timeout events caused by packet corruption through transforming timeout retransmissions into out-of-order retransmissions. We build a probabilistic model to estimate the occurrence probabilities and real effects of the corruption patterns. We implement these two mechanisms with the help of programmable switches and the zero-byte message RDMA feature. We build an ns-3 simulation and implement optimization mechanisms on our testbed. The simulation and testbed experiments show that the optimizations can decrease the flow completion time by several orders of magnitudes with less than 3% bandwidth cost at different packet corruption rates.
What problem does this paper attempt to address?