Abstract:Compute eXpress Link (CXL) is emerging as a promising memory interface technology. Because of the common unavailiability of CXL devices, the performance of the CXL memory is largely unknown. What are the use cases for the CXL memory? What are the impacts of the CXL memory on application performance? How to use the CXL memory in combination with existing memory components? In this work, we study the performance of three genuine CXL memory-expansion cards from different vendors. We characterize the basic performance of the CXL memory, study how HPC applications and large language models can benefit from the CXL memory, and study the interplay between memory tiering and page interleaving. We also propose a novel data object-level interleaving policy to match the interleaving policy with memory access patterns. We reveal the challenges and opportunities of using the CXL memory.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate and explore the performance of Compute eXpress Link (CXL) memory technology in practical applications and its impact on application performance. Specifically, the paper focuses on the following aspects: 1. **Usage Scenarios of CXL Memory**: - What application scenarios are CXL memory suitable for? - How can High - Performance Computing (HPC) applications and large - language models (LLM) benefit from CXL memory? 2. **Impact of CXL Memory on Application Performance**: - What are the specific impacts of actual CXL memory on application performance? - How does CXL memory perform under different memory allocation strategies (e.g., unified page - level interleaving vs. data - object - level interleaving)? 3. **Combined Use of CXL Memory with Other Existing Memory Components**: - How can CXL memory be combined with existing memory components (such as DDR memory) to optimize performance? - What are the impacts of different interleaving strategies (such as page - level interleaving and data - object - level interleaving) on performance? 4. **Performance Characteristics of CXL Memory**: - What are the basic performance characteristics of CXL memory? Including access latency, bandwidth, etc. - Where are the performance bottlenecks of CXL memory? For example, the impact of PCIe interconnection on CXL memory performance. 5. **Challenges and Opportunities of CXL Memory**: - What challenges are there in using CXL memory? For example, increased memory access latency. - What opportunities does using CXL memory bring? For example, supporting larger batch sizes by increasing memory capacity. ### Main Research Contents To answer the above questions, the paper conducted the following research: - **Performance Evaluation**: Evaluate the basic performance of CXL memory, including access latency and bandwidth, through three real CXL memory expansion cards from different manufacturers. - **HPC Application Analysis**: Study a series of HPC workloads to explore how CXL memory affects the performance of these applications, especially for compute - intensive applications. - **Memory Interleaving Strategy**: Propose a new data - object - level interleaving strategy to optimize the memory access pattern and compare its effect with traditional interleaving strategies. - **Application in Large - Language Models**: Study the impact on the training and inference of large - language models when using CXL memory as GPU memory expansion. ### Key Findings - **Performance Characteristics**: The access latency of CXL memory is significantly higher than that of local DDR memory, but in some cases (such as when the system load is high), its performance is close to that of remote DDR memory. - **HPC Application Potential**: Some HPC applications (such as CG and BT) can tolerate the low bandwidth and high latency of CXL memory at a specific scale because they are compute - intensive applications. - **Memory Interleaving Strategy**: The proposed object - level interleaving strategy can significantly reduce the usage of fast memory (by an average of 48%) while maintaining performance similar to the LDRAM - first strategy. - **Large - Language - Model Challenges**: The performance improvement brought by tensor offloading using CXL memory is limited, but computational offloading (such as optimizer and attention calculations) can benefit from the additional bandwidth. In conclusion, this paper aims to comprehensively evaluate the actual performance of CXL memory and explore its potential and challenges in different application scenarios.

Exploring and Evaluating Real-world CXL: Use Cases and System Adoption

Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices

An Overview of Computing-in-Memory Interfaces

CXL-Interference: Analysis and Characterization in Modern Computer Systems

A Comprehensive Simulation Framework for CXL Disaggregated Memory

Streamlining CXL Adoption for Hyperscale Efficiency

An Introduction to the Compute Express Link (CXL) Interconnect

CXL Memory as Persistent Memory for Disaggregated HPC: A Practical Approach

Modeling and Benchmarking Computing-in-Memory for Design Space Exploration.

A CXL- Powered Database System: Opportunities and Challenges

Dissecting CXL Memory Performance at Scale: Analysis, Modeling, and Optimization

Toward CXL-Native Memory Tiering Via Device-Side Profiling

Improving key-value cache performance with heterogeneous memory tiering: A case study of CXL-based memory expansion

emucxl: an emulation framework for CXL-based disaggregated memory applications

A Physical-Aware Framework for Memory Network Design Space Exploration

A Programming Model for Disaggregated Memory over CXL

Memory Sharing with CXL: Hardware and Software Design Approaches

NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering

CXLMemUring: A Hardware Software Co-design Paradigm for Asynchronous and Flexible Parallel CXL Memory Pool Access

CXL over Ethernet: A Novel FPGA-based Memory Disaggregation Design in Data Centers

CXL and the Return of Scale-Up Database Engines