Abstract:Deep learning (DL)-based personalized recommendation systems consume the major resources in modern AI data centers. The embedding layers with large memory capacity requirement and high bandwidth demand have been identified as the bottleneck of personalized recommendation inference. To mitigate the memory bandwidth bottleneck, near-memory processing (NMP) would be an effective solution which utilizes the through-silicon via (TSV) bandwidth within 3D-stacked DRAMs. However, existing NMP architectures suffer from the limited memory bandwidth caused by hard-to-scale TSVs. To overcome this obstacle, integrating the compute-logic near memory banks becomes a promising but challenging solution, since large memory capacity requirement limits the use of 3D-stacked DRAMs and irregular memory accesses lead to poor data locality, heavy TSV data traffic and low bank-level bandwidth utilization. To address this problem, we propose RecPIM, the first in-memory processing system for personalized recommendation inference using near-bank architecture based on 3D-stacked memory. From the hardware perspective, we introduce a heterogeneous memory system combined with 3D-stacked DRAM and DIMMs to accommodate large embedding tables and provide high bandwidth. By integrating processing logic units near memory banks on DRAM dies, our architecture can exploit the enormous bank-level bandwidth which is much higher than TSV bandwidth. Then, we integrate a small scratchpad memory to exploit the unique data reusability of DL-based personalized recommendation systems. Furthermore, we adopt a unidirectional data communication scheme to avoid additional cross-vault data transfer. From the software perspective, we present a customized programming model to facilitate memory management and task offloading. To reduce the data communication through TSVs and enhance the utilization of bank-level bandwidth, we develop an efficient data mapping scheme by partitioning the vector into smaller subvectors. Experimental results show that RecPIM achieves up to 2.58x speedup and 49.8% energy saving for data movement over the state-of-the-art NMP solution.

Darwin: A DRAM-based Multi-level Processing-in-Memory Architecture for Data Analytics

PIM-DH: ReRAM-based processing-in-memory architecture for deep hashing acceleration

A design framework for processing-in-memory accelerator

Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

PIMSAB: A P Rocessing- I N- M Emory System with S Patially- A Ware Communication and B It-Serial-aware Computation

Shared-PIM: Enabling Concurrent Computation and Data Flow for Faster Processing-in-DRAM

DDC-PIM: Efficient Algorithm/Architecture Co-design for Doubling Data Capacity of SRAM-based Processing-In-Memory

Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product

NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators with 3D-Stacked-DRAM

Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals

An Overview of Processing-in-Memory Circuits for Artificial Intelligence and Machine Learning

Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud

DyPIM: Dynamic-Inference-Enabled Processing - In-Memory Accelerator

RIMAC: an Array-Level ADC/DAC-Free ReRAM-Based In-Memory DNN Processor with Analog Cache and Computation.

PIM-AI: A Novel Architecture for High-Efficiency LLM Inference

PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System

A Survey of Resource Management for Processing-in-Memory and Near-Memory Processing Architectures

DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN Training and Inference

Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware

RecPIM: Efficient In-Memory Processing for Personalized Recommendation Inference Using Near-Bank Architecture