Abstract:DRAM-based Processing in Memory (PIM) addresses the "memory wall" problem by incorporating computing units (PIM units) into main memory devices for faster and wider local data access. However, critical challenges prevent PIM units from being compatible with existing CPU hosts. Memory interleaving and virtual memory limit the size of contiguous data visible to PIM units that constrains the granularity of PIM tasks. Fine-grained PIM tasks result in significant CPU-PIM offloading overhead, offsetting the speed-up of PIM. Existing PIM systems adopt drastic measures to ensure PIM task offloading efficiency, including isolating PIM memory space and turning off global memory interleaving. These interventions, however, decrease the CPU's memory bandwidth and introduce extra data transfer, leading to an additional "system memory wall". This new "wall" must be eliminated before fully embracing the PIM technology. In this work, we propose UM-PIM, a PIM system with interleaved CPU pages and non-interleaved PIM pages coexisting in a Uniform and Shared Memory space. UM-PIM enables zero-copy during PIM task offloading and maintains the CPU's memory bandwidth while ensuring PIM offloading efficiency. Firstly, we propose a dual-track memory management mechanism consisting of independent page allocation and address translation for the two kinds of pages, respectively. Second, we design UM-PIM interface hardware on the DIMM (with PIMs) side to provide a dynamic address mapping for accelerating the data re-layout. Finally, we provide APIs to reduce PIM-to-PIM communication overhead by optimizing the CPU's access to PIM pages in different communication modes. We compare UM-PIM with a CPU system and the current PIM systems. Results show negligible performance degradation for CPU workloads (<0.1%) on UM-PIM, contrasting with the 25.8% degradation on the current PIM system with memory interleaving switched off. For PIM workloads partitioned to CPU and PIM units, UM-PIM can reduce the CPU time by 4.93x, resulting in an end-to-end 1.96x speedup on average.

PUMA: Efficient and Low-Cost Memory Allocation and Alignment Support for Processing-Using-Memory Architectures

A design framework for processing-in-memory accelerator

A$^3$PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader

MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Processing

Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions

PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems

Methodologies, Workloads, and Tools for Processing-in-Memory: Enabling the Adoption of Data-Centric Architectures

DPA: Demand-Based Partition and Data Allocation for Hybrid On-Chip Memory

Low-Power Low-Latency Data Allocation for Hybrid Scratch-Pad Memory

Shared-PIM: Enabling Concurrent Computation and Data Flow for Faster Processing-in-DRAM

UM-PIM: DRAM-based PIM with Uniform & Shared Memory Space

PMAlloc: A Holistic Approach to Improving Persistent Memory Allocation

DaPPA: A Data-Parallel Framework for Processing-in-Memory Architectures

PULSAR: Simultaneous Many-Row Activation for Reliable and High-Performance Computing in Off-the-Shelf DRAM Chips

Object-Level Memory Allocation and Migration in Hybrid Memory Systems

Hardware Memory Management for Future Mobile Hybrid Memory Systems

Processing Data Where It Makes Sense: Enabling In-Memory Computation

Efficient Utilization of Scratch-Pad Memory Banks

MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory

A Scalable Physical Memory Allocation Scheme for L4 Microkernel

PIM-STM: Software Transactional Memory for Processing-In-Memory Systems