Abstract:Deploying advanced large language models on edge devices, such as smartphones and robotics, is a growing trend that enhances user data privacy and network connectivity resilience while preserving intelligent capabilities. However, such a task exhibits single-batch computing with incredibly low arithmetic intensity, which poses the significant challenges of huge memory footprint and bandwidth demands on limited edge resources. To address these issues, we introduce Cambricon-LLM, a chiplet-based hybrid architecture with NPU and a dedicated NAND flash chip to enable efficient on-device inference of 70B LLMs. Such a hybrid architecture utilizes both the high computing capability of NPU and the data capacity of the NAND flash chip, with the proposed hardware-tiling strategy that minimizes the data movement overhead between NPU and NAND flash chip. Specifically, the NAND flash chip, enhanced by our innovative in-flash computing and on-die ECC techniques, excels at performing precise lightweight on-die processing. Simultaneously, the NPU collaborates with the flash chip for matrix operations and handles special function computations beyond the flash's on-die processing capabilities. Overall, Cambricon-LLM enables the on-device inference of 70B LLMs at a speed of 3.44 token/s, and 7B LLMs at a speed of 36.34 token/s, which is over 22X to 45X faster than existing flash-offloading technologies, showing the potentiality of deploying powerful LLMs in edge devices.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problems of huge memory occupation and bandwidth requirements when deploying large - language models (LLMs) on edge devices such as smart phones and robots. Specifically, the paper addresses the following two main challenges: 1. **Huge memory occupation**: - A large - language model (e.g., Llama - 70B) requires approximately 70GB of memory after INT8 quantization, which far exceeds the capacity of the DRAM in a typical smart phone. - A large number of parameters lead to frequent data movement, which is the main source of energy consumption during the single - batch inference process of edge devices. 2. **Extremely low arithmetic intensity and high bandwidth requirements**: - The arithmetic intensity of single - batch inference is very low (only 2), which means that the program is severely limited by memory bandwidth. - Compared with traditional AI algorithms (such as DLRM, BERT, and VGG), the arithmetic intensity of single - batch inference of LLMs is 30 to 100 times lower and far below the capabilities of hardware (such as Apple A16, NVIDIA A100, and NVIDIA Jetson Orin). To solve these problems, the paper proposes Cambricon - LLM, a chiplet - based hybrid architecture that combines neural processing units (NPUs) and dedicated NAND flash chips. This architecture enables efficient edge - device inference in the following ways: - **Utilizing the high computing power of NPUs and the large data capacity of NAND flash**: Through an optimized hardware partitioning strategy, the data transfer overhead between NPUs and NAND flash is minimized. - **Innovative on - chip computing and on - chip ECC technology**: The NAND flash chip enhances on - chip computing power and introduces an ultra - lightweight on - chip error - correction unit (ECC), ensuring the accuracy of inference. Through these designs, Cambricon - LLM can perform 70B LLM inference on edge devices at a speed of 3.44 token/s, which is 22 to 45 times faster than existing flash offloading techniques, demonstrating the potential for deploying powerful LLMs on edge devices.

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM

DaDianNao: A Machine-Learning Supercomputer

Neural Network Acceleration and Voice Recognition with a Flash-based In-Memory Computing SoC

LLMCad: Fast and Scalable On-device Large Language Model Inference

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Hecaton: Training Large Language Models with Scalable Chiplet Systems

WiP: Efficient LLM Prefilling with Mobile NPU

Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching

COMB-MCM: Computing-on-Memory-Boundary NN Processor with Bipolar Bitwise Sparsity Optimization for Scalable Multi-Chiplet-Module Edge Machine Learning.

EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models

PIM-AI: A Novel Architecture for High-Efficiency LLM Inference

Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management

Computing Utilization Enhancement for Chiplet-based Homogeneous Processing-in-Memory Deep Learning Processors

Distributed Inference Performance Optimization for LLMs on CPUs

Efficient Deployment of Large Language Model Across Cloud-Device Systems

EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models

EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

Design of Computing-in-Memory (CIM) with Vertical Split-Gate Flash Memory for Deep Neural Network (DNN) Inference Accelerator

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs