Abstract:Memory disaggregation is a promising architecture for modern datacenters that separates compute and memory resources into independent pools connected by ultra-fast networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing memory disaggregation solutions based on remote direct memory access (RDMA) suffer from high latency and additional overheads including page faults and code refactoring. Emerging cache-coherent interconnects such as CXL offer opportunities to reconstruct high-performance memory disaggregation. However, existing CXL-based approaches have physical distance limitation and cannot be deployed across racks. In this article, we propose Rcmp, a novel low-latency and highly scalable memory disaggregation system based on RDMA and CXL. The significant feature is that Rcmp improves the performance of RDMA-based systems via CXL, and leverages RDMA to overcome CXL’s distance limitation. To address the challenges of the mismatch between RDMA and CXL in terms of granularity, communication, and performance, Rcmp (1) provides a global page-based memory space management and enables fine-grained data access, (2) designs an efficient communication mechanism to avoid communication blocking issues, (3) proposes a hot-page identification and swapping strategy to reduce RDMA communications, and (4) designs an RDMA-optimized RPC framework to accelerate RDMA transfers. We implement a prototype of Rcmp and evaluate its performance by using micro-benchmarks and running a key-value store with YCSB benchmarks. The results show that Rcmp can achieve 5.2× lower latency and 3.8× higher throughput than RDMA-based systems. We also demonstrate that Rcmp can scale well with the increasing number of nodes without compromising performance.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address several key issues present in memory disaggregation systems in modern data centers. Specifically: 1. **Limitations of Existing RDMA-based Memory Disaggregation Systems**: - **High Latency**: Although existing RDMA networks can achieve microsecond-level latency (1.5~3 μs), there is still a significant gap compared to the nanosecond-level latency of DRAM memory (80~140 ns). This latency leads to performance bottlenecks when accessing memory pools. - **Additional Overhead**: Due to the lack of native support for memory semantics, RDMA requires intrusive code modifications to the original system and introduces interrupt overhead. Current RDMA-based memory disaggregation methods include page-based and object-based approaches. The former involves page fault handling and read/write amplification, while the latter requires custom interface changes and source code-level modifications. 2. **Limitations of Existing CXL-based Memory Disaggregation Systems**: - **Physical Distance Limitation**: Current CXL technology can only be deployed within a rack, and even the latest CXL 3.0 specification struggles to span across racks, limiting its scalability. - **Cost Issues**: Replacing all RDMA hardware in data centers with CXL hardware is very costly, especially for large-scale clusters. Additionally, the lack of commercially available large-scale production CXL hardware and its supporting infrastructure means that CXL research often relies on custom FPGA prototypes or CPU-less NUMA node emulation. To address the above issues, the paper proposes the Rcmp system, a new hybrid memory disaggregation architecture that combines the advantages of RDMA and CXL technologies. Rcmp builds small CXL-based memory pools within each rack and uses RDMA to connect these memory pools to form a larger memory pool, thereby overcoming the distance limitation of CXL. Rcmp designs several optimizations to address the granularity, communication, and performance mismatches between RDMA and CXL, including: - A global page management mechanism that supports fine-grained data access; - An efficient cross-rack communication mechanism to avoid communication blocking; - Hot page identification and exchange strategies to reduce cross-rack access; - A high-performance RDMA-optimized remote procedure call (RPC) framework to accelerate cross-rack RDMA transmission. Through these optimizations, Rcmp achieves lower latency (3 to 8 times reduction) and higher throughput (2 to 4 times improvement) compared to traditional RDMA-based memory disaggregation systems, and it has good scalability.

Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL

CXL over Ethernet: A Novel FPGA-based Memory Disaggregation Design in Data Centers

Telepathic Datacenters: Fast RPCs using Shared CXL Memory

CXL Memory as Persistent Memory for Disaggregated HPC: A Practical Approach

Design and Evaluation of a Rack-Scale Disaggregated Memory Architecture For Data Centers

Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices

DmRPC: Disaggregated Memory-aware Datacenter RPC for Data-intensive Applications

MC-RDMA: Improving Replication Performance of RDMA-based Distributed Systems with Reliable Multicast Support

Scaling Up Memory Disaggregated Applications with SMART

Scalable RDMA RPC on Reliable Connection with Efficient Resource Sharing

A Quantitative Approach for Adopting Disaggregated Memory in HPC Systems

Efficient Distributed Memory Management with RDMA and Caching

A Comprehensive Simulation Framework for CXL Disaggregated Memory

The case for distributed shared-memory databases with RDMA-enabled memory disaggregation

Partial Failure Resilient Memory Management System for (Cxl-Based) Distributed Shared Memory

POSTER: CAVER: Enhancing RDMA Load Balancing by Hunting Less-Congested Paths

emucxl: an emulation framework for CXL-based disaggregated memory applications

Systems for Memory Disaggregation: Challenges & Opportunities

Evaluating the Potential of Disaggregated Memory Systems for HPC applications

NP-RDMA: Using Commodity RDMA without Pinning Memory

Maximizing the Benefit of RDMA at End Hosts