Abstract:PCIe devices, such as SSDs and GPUs, are pivotal in modern data centers, and their value is set to grow amidst the emergence of AI and large models. However, these devices face onboard DRAM shortage issue due to internal space limitation, preventing accommodation of sufficient DRAM modules alongside flash or GPU processing chips. Current solutions either curb device-internal memory usage or supplement slower non-DRAM mediums, prove inadequate or performance-compromising. This paper introduces the Linked Memory Buffer (LMB), a scalable solution utilizing the CXL memory expander to tackle device onboard memory deficiencies. The low-latency of CXL enables LMB to utilize emerging DRAM memory expander to efficiently supplement device onboard DRAM with minimal impact on performance.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the shortage of internal DRAM in PCIe devices (such as SSDs and GPUs). With the rise of AI and large - scale models, the value of these devices in modern data centers is increasing day by day. However, they are faced with the problem that the limited internal space makes it impossible to accommodate enough DRAM modules, which restricts the further improvement of device performance. ### Specific problem description: 1. **Internal space limitations**: - Due to the limited internal space, PCIe devices (such as SSDs, GPUs, and DPUs) cannot accommodate enough DRAM modules. For example, the standard DRAM configuration of enterprise - level SSDs is only 0.1% of the capacity, and the mainstream DRAM technology limits the internal memory of SSDs to 32GB, although QLC technology can provide more than 32TB of storage in the U.2 form. - DRAM must be placed close to the SSD controller, similar to server memory being close to the CPU socket, which further restricts the expansion of DRAM. 2. **Insufficiencies of existing solutions**: - Current solutions either suppress the use of internal memory in devices or supplement with slower non - DRAM media. These methods are either ineffective or sacrifice performance. - For example, DFTL uses flash memory instead of DRAM for L2P indexing, but its performance is limited due to the need for two reads (one for reading the index and one for reading the data), and it is only suitable for mobile devices. - Unified Virtual Memory (UVM) can partially relieve the problem of insufficient GPU memory, but there are still obvious performance bottlenecks when dealing with large - scale dataset training. ### Solution proposed in the paper: The paper introduces **Linked Memory Buffer (LMB)**, a scalable solution based on CXL (Compute Express Link) memory expander, aiming to efficiently supplement the internal DRAM of devices through the low - latency CXL protocol while minimizing the impact on performance. - **Core idea of LMB**: Through the CXL protocol, LMB can dynamically expand the memory of PCIe devices and allow memory resources to be shared between CXL and PCIe devices based on efficient point - to - point access or host - forwarding. - **Specific implementation**: The LMB framework includes components such as CXL memory expander, Fabric Manager (FM), and kernel modules, providing a unified memory allocation interface, so that the unified memory drivers of NVMe and CUDA can directly and efficiently access the CXL memory expander. ### Conclusion: The LMB framework aims to solve the problem of shortage of internal DRAM in PCIe devices. Through CXL technology, it realizes memory expansion, ensuring high - bandwidth and low - latency memory access, thereby improving the overall performance of devices.

LMB: Augmenting PCIe Devices with CXL-Linked Memory Buffer

DaDianNao: A Machine-Learning Supercomputer

Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices

Improving key-value cache performance with heterogeneous memory tiering: A case study of CXL-based memory expansion

ICGMM: CXL-enabled Memory Expansion with Intelligent Caching Using Gaussian Mixture Model

CXLMemUring: A Hardware Software Co-design Paradigm for Asynchronous and Flexible Parallel CXL Memory Pool Access

CXL Memory as Persistent Memory for Disaggregated HPC: A Practical Approach

Exploring and Evaluating Real-world CXL: Use Cases and System Adoption

A Comprehensive Simulation Framework for CXL Disaggregated Memory

Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders

CXL over Ethernet: A Novel FPGA-based Memory Disaggregation Design in Data Centers

Polaris: Enhancing CXL-based Memory Expanders with Memory-side Prefetching.

A Study of Leveraging Memory Level Parallelism for DRAM System on Multi-core/Many-Core Architecture

Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory

Stream-Based Data Placement for Near-Data Processing with Extended Memory

Toward CXL-Native Memory Tiering Via Device-Side Profiling

GPU Graph Processing on CXL-Based Microsecond-Latency External Memory

Lowering Latency of Embedded Memory by Exploiting In-Cell Victim Cache Hierarchy Based on Emerging Multi-Level Memory Devices

CXL and the Return of Scale-Up Database Engines

Streamlining CXL Adoption for Hyperscale Efficiency