LMB: Augmenting PCIe Devices with CXL-Linked Memory Buffer

Jiapin Wang,Xiangping Zhang,Chenlei Tang,Xiang Chen,Tao Lu
2024-06-04
Abstract:PCIe devices, such as SSDs and GPUs, are pivotal in modern data centers, and their value is set to grow amidst the emergence of AI and large models. However, these devices face onboard DRAM shortage issue due to internal space limitation, preventing accommodation of sufficient DRAM modules alongside flash or GPU processing chips. Current solutions either curb device-internal memory usage or supplement slower non-DRAM mediums, prove inadequate or performance-compromising. This paper introduces the Linked Memory Buffer (LMB), a scalable solution utilizing the CXL memory expander to tackle device onboard memory deficiencies. The low-latency of CXL enables LMB to utilize emerging DRAM memory expander to efficiently supplement device onboard DRAM with minimal impact on performance.
Hardware Architecture
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the shortage of internal DRAM in PCIe devices (such as SSDs and GPUs). With the rise of AI and large - scale models, the value of these devices in modern data centers is increasing day by day. However, they are faced with the problem that the limited internal space makes it impossible to accommodate enough DRAM modules, which restricts the further improvement of device performance. ### Specific problem description: 1. **Internal space limitations**: - Due to the limited internal space, PCIe devices (such as SSDs, GPUs, and DPUs) cannot accommodate enough DRAM modules. For example, the standard DRAM configuration of enterprise - level SSDs is only 0.1% of the capacity, and the mainstream DRAM technology limits the internal memory of SSDs to 32GB, although QLC technology can provide more than 32TB of storage in the U.2 form. - DRAM must be placed close to the SSD controller, similar to server memory being close to the CPU socket, which further restricts the expansion of DRAM. 2. **Insufficiencies of existing solutions**: - Current solutions either suppress the use of internal memory in devices or supplement with slower non - DRAM media. These methods are either ineffective or sacrifice performance. - For example, DFTL uses flash memory instead of DRAM for L2P indexing, but its performance is limited due to the need for two reads (one for reading the index and one for reading the data), and it is only suitable for mobile devices. - Unified Virtual Memory (UVM) can partially relieve the problem of insufficient GPU memory, but there are still obvious performance bottlenecks when dealing with large - scale dataset training. ### Solution proposed in the paper: The paper introduces **Linked Memory Buffer (LMB)**, a scalable solution based on CXL (Compute Express Link) memory expander, aiming to efficiently supplement the internal DRAM of devices through the low - latency CXL protocol while minimizing the impact on performance. - **Core idea of LMB**: Through the CXL protocol, LMB can dynamically expand the memory of PCIe devices and allow memory resources to be shared between CXL and PCIe devices based on efficient point - to - point access or host - forwarding. - **Specific implementation**: The LMB framework includes components such as CXL memory expander, Fabric Manager (FM), and kernel modules, providing a unified memory allocation interface, so that the unified memory drivers of NVMe and CUDA can directly and efficiently access the CXL memory expander. ### Conclusion: The LMB framework aims to solve the problem of shortage of internal DRAM in PCIe devices. Through CXL technology, it realizes memory expansion, ensuring high - bandwidth and low - latency memory access, thereby improving the overall performance of devices.