Abstract:Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched processing, QKV generation and feed-forward networks involve compute-intensive matrix-matrix multiplications (GEMM), while multi-head attention requires bandwidth-heavy matrix-vector multiplications (GEMV). Machine learning accelerators like TPUs or NPUs are proficient in handling GEMM but are less efficient for GEMV computations. Conversely, Processing-in-Memory (PIM) technology is tailored for efficient GEMV computation, while it lacks the computational power to handle GEMM effectively. Inspired by this insight, we propose NeuPIMs, a heterogeneous acceleration system that jointly exploits a conventional GEMM-focused NPU and GEMV-optimized PIM devices. The main challenge in efficiently integrating NPU and PIM lies in enabling concurrent operations on both platforms, each addressing a specific kernel type. First, existing PIMs typically operate in a "blocked" mode, allowing only either NPU or PIM to be active at any given time. Second, the inherent dependencies between GEMM and GEMV in LLMs restrict their parallel processing. To tackle these challenges, NeuPIMs is equipped with dual row buffers in each bank, facilitating the simultaneous management of memory read/write operations and PIM commands. Further, NeuPIMs employs a runtime sub-batch interleaving technique to maximize concurrent execution, leveraging batch parallelism to allow two independent sub-batches to be pipelined within a single NeuPIMs device. Our evaluation demonstrates that compared to GPU-only, NPU-only, and a naïve NPU+PIM integrated acceleration approaches, NeuPIMs achieves 3$\times$, 2.4$\times$ and 1.6$\times$ throughput improvement, respectively.

Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

A design framework for processing-in-memory accelerator

VSPIM: SRAM Processing-in-Memory DNN Acceleration via Vector-Scalar Operations

SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration

RIMAC: an Array-Level ADC/DAC-Free ReRAM-Based In-Memory DNN Processor with Analog Cache and Computation.

TIMELY: Pushing Data Movements and Interfaces in PIM Accelerators Towards Local and in Time Domain

DyPIM: Dynamic-Inference-Enabled Processing - In-Memory Accelerator

NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators with 3D-Stacked-DRAM

PIM-HLS: An Automatic Hardware Generation Tool for Heterogeneous Processing-In-Memory-based Neural Network Accelerators.

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

ReHy: A ReRAM-based Digital/Analog Hybrid PIM Architecture for Accelerating CNN Training

Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud

pPIM: A Programmable Processor-in-Memory Architecture With Precision-Scaling for Deep Learning

PIMulator-NN: an Event-Driven, Cross-level Simulation Framework for Processing-In-Memory Based Neural Network Accelerators

An Energy-Efficient Quantized and Regularized Training Framework for Processing-In-Memory Accelerators

ConvPIM: Evaluating Digital Processing-in-Memory through Convolutional Neural Network Acceleration

Generalized Ping-Pong: Off-Chip Memory Bandwidth Centric Pipelining Strategy for Processing-In-Memory Accelerators

Reliability-Aware Training and Performance Modeling for Processing-In-Memory Systems

DDC-PIM: Efficient Algorithm/Architecture Co-design for Doubling Data Capacity of SRAM-based Processing-In-Memory

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing