POSTER: CAVER: Enhancing RDMA Load Balancing by Hunting Less-Congested Paths

Haotian Deng,Yuan Yang,Menghao Zhang,Mingwei Xu
DOI: https://doi.org/10.1145/3672202.3673729
2024-01-01
Abstract:Remote Direct Memory Access (RDMA) has become a prevailing technology for modern data centers (DCs) to achieve high throughput and low latency [8]. Many DCs have adopted RDMA over Converged Ethernet v2 (RoCEv2) [3] to provide superior performance for emerging application paradigms such as cloud storage [4] and distributed deep learning [11]. Network load balancing (LB) plays a critical role in optimizing the DC network performance. There is a large body of literature studying LB for traditional DCs [1, 2, 6, 7, 9], as it is well-known that the widely-used ECMP [12] has a limited LB performance. However, RDMA operates in a different manner compared to traditional TCP-based data transmission, and existing studies for traditional DCs do not well fit RDMA-enabled DCs. For example, RDMA is very sensitive to out-of-order packets which may lead to significant throughput degradation, and also, RDMA flow can hardly be partitioned into flowlets. Thus, existing packet-level LB approaches [6, 7] and flowlet-level LB approaches [2] perform poorly in RDMA-enabled DCs.
What problem does this paper attempt to address?