Enabling Distributed and Optimal RDMA Resource Sharing in Large-Scale Data Center Networks: Modeling, Analysis, and Implementation

Dian Shen,Junzhou Luo,Fang Dong,Xiaolin Guo,Ciyuan Chen,Kai Wang,John C. S. Lui
DOI: https://doi.org/10.1109/TNET.2023.3263562
2023-01-01
Abstract:Remote Direct Memory Access (RDMA) suffers from unfairness issues and performance degradation when multiple applications share RDMA network resources. Hence, an efficient resource scheduling mechanism is urged to optimally allocates RDMA resources among applications. However, traditional Network Utility Maximization (NUM) based solutions are inadequate for RDMA due to three challenges: 1) The standard NUM-oriented algorithm cannot deal with coupling variables introduced by multiple dependent RDMA operations; 2) The stringent constraint of RDMA on-board resources complicates the standard NUM by bringing extra optimization dimensions; 3) Naively applying traditional algorithms for NUM suffers from scalability issues in solving a large-scale RDMA resource scheduling problem. In this paper, we present how to optimally share the RDMA resources in large-scale data center networks with a distributed manner. First, we propose Distributed RDMA NUM (DRUM) to model the RDMA resource scheduling problem as a new variation of the NUM problem. Second, we present distributed algorithms to efficiently solve the large-scale, interdependent RDMA resource sharing problem for different RDMA use cases. Through theoretical analysis, the convergence and parallelism of proposed algorithms are guaranteed. Finally, we implement the algorithms as a kernel-level indirection module in the real-world RDMA environment, so as to provide end-to-end resource sharing and performance guarantee. Through extensive evaluations by large-scale simulations and testbed experiments, we show that our method significantly improves applications’ performance under resource contention, achieving <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.7-3.1\times $ </tex-math></inline-formula> higher throughput, and in a dynamic context, the largest performance improvement reaches 98.1% and 64.1% in terms of latency and throughput, respectively.
What problem does this paper attempt to address?