UM-PIM: DRAM-based PIM with Uniform & Shared Memory Space
Yilong Zhao,Mingyu Gao,Fangxin Liu,Yiwei Hu,Zongwu Wang,Han Lin,Ji Li,He Xian,Hanlin Dong,Tao Yang,Naifeng Jing,Xiaoyao Liang,Li Jiang
DOI: https://doi.org/10.1109/isca59077.2024.00053
2024-01-01
Abstract:DRAM-based Processing in Memory (PIM) addresses the "memory wall" problem by incorporating computing units (PIM units) into main memory devices for faster and wider local data access. However, critical challenges prevent PIM units from being compatible with existing CPU hosts. Memory interleaving and virtual memory limit the size of contiguous data visible to PIM units that constrains the granularity of PIM tasks. Fine-grained PIM tasks result in significant CPU-PIM offloading overhead, offsetting the speed-up of PIM. Existing PIM systems adopt drastic measures to ensure PIM task offloading efficiency, including isolating PIM memory space and turning off global memory interleaving. These interventions, however, decrease the CPU's memory bandwidth and introduce extra data transfer, leading to an additional "system memory wall". This new "wall" must be eliminated before fully embracing the PIM technology. In this work, we propose UM-PIM, a PIM system with interleaved CPU pages and non-interleaved PIM pages coexisting in a Uniform and Shared Memory space. UM-PIM enables zero-copy during PIM task offloading and maintains the CPU's memory bandwidth while ensuring PIM offloading efficiency. Firstly, we propose a dual-track memory management mechanism consisting of independent page allocation and address translation for the two kinds of pages, respectively. Second, we design UM-PIM interface hardware on the DIMM (with PIMs) side to provide a dynamic address mapping for accelerating the data re-layout. Finally, we provide APIs to reduce PIM-to-PIM communication overhead by optimizing the CPU's access to PIM pages in different communication modes. We compare UM-PIM with a CPU system and the current PIM systems. Results show negligible performance degradation for CPU workloads (<0.1%) on UM-PIM, contrasting with the 25.8% degradation on the current PIM system with memory interleaving switched off. For PIM workloads partitioned to CPU and PIM units, UM-PIM can reduce the CPU time by 4.93x, resulting in an end-to-end 1.96x speedup on average.