Abstract:Deep learning (DL)-based personalized recommendation systems consume the major resources in modern AI data centers. The embedding layers with large memory capacity requirement and high bandwidth demand have been identified as the bottleneck of personalized recommendation inference. To mitigate the memory bandwidth bottleneck, near-memory processing (NMP) would be an effective solution which utilizes the through-silicon via (TSV) bandwidth within 3D-stacked DRAMs. However, existing NMP architectures suffer from the limited memory bandwidth caused by hard-to-scale TSVs. To overcome this obstacle, integrating the compute-logic near memory banks becomes a promising but challenging solution, since large memory capacity requirement limits the use of 3D-stacked DRAMs and irregular memory accesses lead to poor data locality, heavy TSV data traffic and low bank-level bandwidth utilization. To address this problem, we propose RecPIM, the first in-memory processing system for personalized recommendation inference using near-bank architecture based on 3D-stacked memory. From the hardware perspective, we introduce a heterogeneous memory system combined with 3D-stacked DRAM and DIMMs to accommodate large embedding tables and provide high bandwidth. By integrating processing logic units near memory banks on DRAM dies, our architecture can exploit the enormous bank-level bandwidth which is much higher than TSV bandwidth. Then, we integrate a small scratchpad memory to exploit the unique data reusability of DL-based personalized recommendation systems. Furthermore, we adopt a unidirectional data communication scheme to avoid additional cross-vault data transfer. From the software perspective, we present a customized programming model to facilitate memory management and task offloading. To reduce the data communication through TSVs and enhance the utilization of bank-level bandwidth, we develop an efficient data mapping scheme by partitioning the vector into smaller subvectors. Experimental results show that RecPIM achieves up to 2.58x speedup and 49.8% energy saving for data movement over the state-of-the-art NMP solution.

An Efficient Near-Bank Processing Architecture for Personalized Recommendation System

RecPIM: Efficient In-Memory Processing for Personalized Recommendation Inference Using Near-Bank Architecture

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

Accelerating Personalized Recommendation with Cross-level Near-Memory Processing

A Flexible Embedding-Aware Near Memory Processing Architecture for Recommendation System

DaDianNao: A Machine-Learning Supercomputer

Near-Memory Processing in Action: Accelerating Personalized Recommendation with AxDIMM.

NDRec: A Near-Data Processing System for Training Large-Scale Recommendation Models

Enabling Efficient Large Recommendation Model Training with Near CXL Memory Processing

PIMPR: PIM-based Personalized Recommendation with Heterogeneous Memory Hierarchy

Rerec: In-ReRAM Acceleration with Access-Aware Mapping for Personalized Recommendation

UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture

ARCHER: a ReRAM-based Accelerator for Compressed Recommendation Systems

A heterogeneous 3-D stacked PIM accelerator for GCN-based recommender systems

MicroRec: Efficient Recommendation Inference by Hardware and Data Structure Solutions

RecSSD: near data processing for solid state drive based recommendation inference

Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations

iMARS: An In-Memory-Computing Architecture for Recommendation Systems

Optimizing Inference Quality with SmartNIC for Recommendation System

Stream-Based Data Placement for Near-Data Processing with Extended Memory

Toward Energy Efficient STT-MRAM-based Near Memory Computing Architecture for Embedded Systems