Abstract:Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched processing, QKV generation and feed-forward networks involve compute-intensive matrix-matrix multiplications (GEMM), while multi-head attention requires bandwidth-heavy matrix-vector multiplications (GEMV). Machine learning accelerators like TPUs or NPUs are proficient in handling GEMM but are less efficient for GEMV computations. Conversely, Processing-in-Memory (PIM) technology is tailored for efficient GEMV computation, while it lacks the computational power to handle GEMM effectively. Inspired by this insight, we propose NeuPIMs, a heterogeneous acceleration system that jointly exploits a conventional GEMM-focused NPU and GEMV-optimized PIM devices. The main challenge in efficiently integrating NPU and PIM lies in enabling concurrent operations on both platforms, each addressing a specific kernel type. First, existing PIMs typically operate in a "blocked" mode, allowing only either NPU or PIM to be active at any given time. Second, the inherent dependencies between GEMM and GEMV in LLMs restrict their parallel processing. To tackle these challenges, NeuPIMs is equipped with dual row buffers in each bank, facilitating the simultaneous management of memory read/write operations and PIM commands. Further, NeuPIMs employs a runtime sub-batch interleaving technique to maximize concurrent execution, leveraging batch parallelism to allow two independent sub-batches to be pipelined within a single NeuPIMs device. Our evaluation demonstrates that compared to GPU-only, NPU-only, and a naïve NPU+PIM integrated acceleration approaches, NeuPIMs achieves 3$\times$, 2.4$\times$ and 1.6$\times$ throughput improvement, respectively.

WiP: Efficient LLM Prefilling with Mobile NPU

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM

LLMCad: Fast and Scalable On-device Large Language Model Inference

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Distributed Inference Performance Optimization for LLMs on CPUs

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices

Efficient Deployment of Large Language Model Across Cloud-Device Systems

LLM as a System Service on Mobile Devices

Progressive Mixed-Precision Decoding for Efficient LLM Inference

Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management

ELMS: Elasticized Large Language Models On Mobile Devices

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

Efficient and Economic Large Language Model Inference with Attention Offloading

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs

Mobile Edge Intelligence for Large Language Models: A Contemporary Survey

LiveMind: Low-latency Large Language Models with Simultaneous Inference

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

PowerInfer-2: Fast Large Language Model Inference on a Smartphone