CXLMemUring: A Hardware Software Co-design Paradigm for Asynchronous and Flexible Parallel CXL Memory Pool Access

Yiwei Yang
DOI: https://doi.org/10.48550/arXiv.2309.04011
2023-09-08
Abstract:CXL has been the emerging technology for expanding memory for both the host CPU and device accelerators with load/store interface. Extending memory coherency to the PCIe root complex makes the codesign more flexible in that you can access the memory with coherency using your near-device computability. Since the capacity demand with tolerable latency and bandwidth is growing, we need to come up with a new hardware-software codesign way to offload the synthesized memory operations to the CXL endpoint, CXL switch or near CXL root complex cores like Intel DSA to fetch data; the CPU or accelerators can calculate other stuff in the backend. On CXL done loading, the data will be put into L1 if capacity fits, and the in-core ROB will be notified by mailbox and resume the calculation on the previous hardware context. Since the distance(timing window) of the load instruction sequence is unknown, a profiling-guided way of codegening and adaptively updating offloaded code will be required for a long-running job. We propose to evaluate CXLMemUring the modified BOOMv3 with added in-core-logic and CXL endpoint access simulation using CHI, and we will add a weaker RISCV Core near endpoint for code offloading, and the codegening will be based on program analysis with traditional profiling guided way.
Hardware Architecture
What problem does this paper attempt to address?
This paper aims to solve the problem of the Memory Wall in current computing systems, especially in fields such as High - Performance Computing (HPC), Deep Learning Recommendation Model (DLRM) and large - language - model (LLM) training. Specifically, the paper focuses on how to maintain low latency and high bandwidth while expanding memory capacity to support the effective expansion of these applications. ### Main problems the paper attempts to solve: 1. **Memory Expansion and Performance Optimization**: - Current technologies such as ROB (Reorder Buffer), MSHR (Miss Status Handling Register), pre - read caches, stack elimination or TLB (Translation Lookaside Buffer) can hide memory latency to a certain extent, but they cannot be directly extended to the CXL (Compute Express Link) memory pool. - With the increasing demand for memory capacity, a new hardware - software co - design method is required to effectively manage and access the CXL memory pool while maintaining low latency and high bandwidth. 2. **Asynchronous and Flexible Memory Access**: - Traditional memory access methods usually rely on synchronous operations, which will lead to significant performance bottlenecks when dealing with large - scale data sets. - The paper proposes an asynchronous and flexible memory access method based on CXL technology. By offloading synthetic memory operations to CXL endpoints, CXL switches or cores close to the CXL root complex (such as Intel DSA), the CPU or accelerator is allowed to perform other tasks in the background. 3. **Hardware - Software Co - Design**: - To achieve the above goals, the paper proposes a hardware - software co - design paradigm, including a modified BOOMv3 core, CXL endpoint access simulation, and a weaker RISC - V core for code offloading. - The software part includes a JIT compiler, which can dynamically analyze the offloading window and adaptively update the offloading code according to the access pattern. - The hardware part includes an asynchronous load engine. When the data loading is completed, the CPU can be notified to resume the previous context through the mailbox mechanism. 4. **Evaluation and Optimization**: - The paper also proposes an evaluation framework to verify the effectiveness of the proposed scheme in terms of capturing instruction window size, integration with ROB and MSHR, chip area comparison, and guiding the programming model. Through these methods, the paper hopes to provide an effective solution to overcome the challenges brought by the Memory Wall and improve the overall performance of the computing system.