CXL Shared Memory Programming: Barely Distributed and Almost Persistent

Yi Xu,Suyash Mahar,Ziheng Liu,Mingyao Shen,Steven Swanson
2024-07-17
Abstract:While Compute Express Link (CXL) enables support for cache-coherent shared memory among multiple nodes, it also introduces new types of failures--processes can fail before data does, or data might fail before a process does. The lack of a failure model for CXL-based shared memory makes it challenging to understand and mitigate these failures. To solve these challenges, in this paper, we describe a model categorizing and handling the CXL-based shared memory's failures: data and process failures. Data failures in CXL-based shared memory render data inaccessible or inconsistent for a currently running application. We argue that such failures are unlike data failures in distributed storage systems and require CXL-specific handling. To address this, we look into traditional data failure mitigation techniques like erasure coding and replication and propose new solutions to better handle data failures in CXL-based shared memory systems. Next, we look into process failures and compare the failures and potential solutions with PMEM's failure model and programming solutions. We argue that although PMEM shares some of CXL's characteristics, it does not fully address CXL's volatile nature and low access latencies. Finally, taking inspiration from PMEM programming solutions, we propose techniques to handle these new failures. Thus, this paper is the first work to define the CXL-based shared memory failure model and propose tailored solutions that address challenges specific to CXL-based systems.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address fault handling issues in shared memory systems based on Compute Express Link (CXL). Specifically: 1. **Data Faults**: In CXL systems, when the data of a node becomes unavailable or inconsistent, it can cause problems for running applications. These types of faults are similar to data faults in distributed storage systems but require specific handling tailored to the characteristics of CXL systems. 2. **Process Faults**: When a process accessing CXL memory fails, although the data remains accessible, it may lead to data inconsistency. In CXL systems, if a process fails while updating the shared memory connected via CXL, it may result in data inconsistency. To address these challenges, the paper proposes the following solutions: - **Data Fault Handling**: Drawing from traditional distributed systems' data fault handling methods (such as erasure coding and replication), the paper proposes improved mechanisms tailored for CXL systems, such as replication mechanisms based on CXL switches. - **Process Fault Handling**: Although the fault model of persistent memory (PMEM) systems is similar to CXL systems in some aspects, their solutions are not entirely applicable to CXL systems. The paper proposes several different process fault handling methods, including logging, checkpointing, etc., and optimizes them for the performance characteristics of CXL systems. - **Comprehensive Handling of Data and Process Faults**: The paper also explores methods to handle both data and process faults simultaneously to simplify implementation and improve efficiency. In summary, this paper defines for the first time a fault model for CXL-based shared memory and proposes customized solutions tailored to the characteristics of CXL systems.