Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL

Zhonghua Wang,Yixing Guo,Kai Lu,Jiguang Wan,Daohui Wang,Ting Yao,Huatao Wu
DOI: https://doi.org/10.1145/3634916
IF: 1.444
2024-01-19
ACM Transactions on Architecture and Code Optimization
Abstract:Memory disaggregation is a promising architecture for modern datacenters that separates compute and memory resources into independent pools connected by ultra-fast networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing memory disaggregation solutions based on remote direct memory access (RDMA) suffer from high latency and additional overheads including page faults and code refactoring. Emerging cache-coherent interconnects such as CXL offer opportunities to reconstruct high-performance memory disaggregation. However, existing CXL-based approaches have physical distance limitation and cannot be deployed across racks. In this article, we propose Rcmp, a novel low-latency and highly scalable memory disaggregation system based on RDMA and CXL. The significant feature is that Rcmp improves the performance of RDMA-based systems via CXL, and leverages RDMA to overcome CXL’s distance limitation. To address the challenges of the mismatch between RDMA and CXL in terms of granularity, communication, and performance, Rcmp (1) provides a global page-based memory space management and enables fine-grained data access, (2) designs an efficient communication mechanism to avoid communication blocking issues, (3) proposes a hot-page identification and swapping strategy to reduce RDMA communications, and (4) designs an RDMA-optimized RPC framework to accelerate RDMA transfers. We implement a prototype of Rcmp and evaluate its performance by using micro-benchmarks and running a key-value store with YCSB benchmarks. The results show that Rcmp can achieve 5.2× lower latency and 3.8× higher throughput than RDMA-based systems. We also demonstrate that Rcmp can scale well with the increasing number of nodes without compromising performance.
computer science, theory & methods, hardware & architecture
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address several key issues present in memory disaggregation systems in modern data centers. Specifically: 1. **Limitations of Existing RDMA-based Memory Disaggregation Systems**: - **High Latency**: Although existing RDMA networks can achieve microsecond-level latency (1.5~3 μs), there is still a significant gap compared to the nanosecond-level latency of DRAM memory (80~140 ns). This latency leads to performance bottlenecks when accessing memory pools. - **Additional Overhead**: Due to the lack of native support for memory semantics, RDMA requires intrusive code modifications to the original system and introduces interrupt overhead. Current RDMA-based memory disaggregation methods include page-based and object-based approaches. The former involves page fault handling and read/write amplification, while the latter requires custom interface changes and source code-level modifications. 2. **Limitations of Existing CXL-based Memory Disaggregation Systems**: - **Physical Distance Limitation**: Current CXL technology can only be deployed within a rack, and even the latest CXL 3.0 specification struggles to span across racks, limiting its scalability. - **Cost Issues**: Replacing all RDMA hardware in data centers with CXL hardware is very costly, especially for large-scale clusters. Additionally, the lack of commercially available large-scale production CXL hardware and its supporting infrastructure means that CXL research often relies on custom FPGA prototypes or CPU-less NUMA node emulation. To address the above issues, the paper proposes the Rcmp system, a new hybrid memory disaggregation architecture that combines the advantages of RDMA and CXL technologies. Rcmp builds small CXL-based memory pools within each rack and uses RDMA to connect these memory pools to form a larger memory pool, thereby overcoming the distance limitation of CXL. Rcmp designs several optimizations to address the granularity, communication, and performance mismatches between RDMA and CXL, including: - A global page management mechanism that supports fine-grained data access; - An efficient cross-rack communication mechanism to avoid communication blocking; - Hot page identification and exchange strategies to reduce cross-rack access; - A high-performance RDMA-optimized remote procedure call (RPC) framework to accelerate cross-rack RDMA transmission. Through these optimizations, Rcmp achieves lower latency (3 to 8 times reduction) and higher throughput (2 to 4 times improvement) compared to traditional RDMA-based memory disaggregation systems, and it has good scalability.