LubeRDMA: A Fail-safe Mechanism of RDMA

Shengkai Lin,Qinwei Yang,Zengyin Yang,Yuchuan Wang,Shizhen Zhao
DOI: https://doi.org/10.1145/3663408.3663411
2024-01-01
Abstract:Recent years have witnessed a wide adoption of Remote Direct Memory Access (RDMA) to accelerate distributed systems. As the scale of distributed applications keeps increasing, network failures become more prominent. Although some link/switch failures can be circumvented by in-network rerouting, failures like NIC failure are still fatal in RDMA networks and may cause the entire system to fail. To address this issue, we propose a fail-safe mechanism of RDMA called LubeRDMA. The core idea is to leverage multiple RDMA NICs on a server and treat them as backups for each other. We introduce a vRDMA model that abstracts a failure-resilient RDMA network to the application. With this model, we achieve RDMA fault tolerance and recovery. In our evaluation, we demonstrate that LubeRDMA efficiently handles RDMA failures while having a minimal impact on RDMA performance.
What problem does this paper attempt to address?