Partial Failure Resilient Memory Management System for (Cxl-Based) Distributed Shared Memory

Mingxing Zhang,Teng Ma,Jinqi Hua,Zheng Liu,Kang Chen,Ning Ding,Fan Du,Jinlei Jiang,Tao Ma,Yongwei Wu
DOI: https://doi.org/10.1145/3600006.3613135
2023-01-01
Abstract:The efficiency of distributed shared memory (DSM) has been greatly improved by recent hardware technologies. But, the difficulty of distributed memory management can still be a major obstacle to the democratization of DSM, especially when a partial failure of the participating clients (e.g., due to crashed processes or machines) should be tolerated. In this paper, we present CXL-SHM, an automatic distributed memory management system based on reference counting. The reference count maintenance in CXL-SHM is implemented with a special era-based non-blocking algorithm. Thus, there are no blocking synchronization, memory leak, double free, and wild pointer problems, even if some participating clients unexpectedly fail without freeing their possessed memory references. We evaluated our system on real CXL hardware with both micro-benchmarks and end-to-end applications, which demonstrate the efficiency of CXL-SHM and the simplicity/flexibility of using CXL-SHM to build efficient distributed applications.
What problem does this paper attempt to address?