Abstract:This paper introduces PowerInfer-2, a framework designed for high-speed inference of Large Language Models (LLMs) on smartphones, particularly effective for models whose sizes exceed the device's memory capacity. The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. Specifically, PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference. Additionally, it introduces segmented neuron caching and fine-grained neuron-cluster-level pipelining, which effectively minimize and conceal the overhead caused by I/O operations. The implementation and evaluation of PowerInfer-2 demonstrate its capability to support a wide array of LLM models on two smartphones, achieving up to a 29.2x speed increase compared with state-of-the-art frameworks. Notably, PowerInfer-2 is the first system to serve the TurboSparse-Mixtral-47B model with a generation rate of 11.68 tokens per second on a smartphone. For models that fit entirely within the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM. For more details, including a demonstration video, please visit the project site at <a class="link-external link-http" href="http://www.powerinfer.ai/v2" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve high - speed inference of large - language models (LLMs) on smartphones, especially for cases where the model size exceeds the device's memory capacity. Specifically, the main challenges faced by researchers include: 1. **Hardware resource limitations**: Smartphones have limited computing power, memory, and I/O bandwidth, which are difficult to meet the high - computing and storage requirements for large - language - model inference. 2. **Insufficient utilization of heterogeneous hardware**: Smartphones are usually equipped with multiple heterogeneous processors (such as CPU, GPU, NPU), but the existing inference frameworks fail to fully utilize these hardware resources. 3. **I/O bottleneck**: Since the model weights are too large, some weights need to be stored in external memory, and frequent I/O operations become a performance bottleneck. To solve these problems, PowerInfer - 2 proposes the following innovations: - **Fine - grained neuron - cluster computing**: Decompose the traditional matrix calculation into fine - grained neuron - cluster calculations to better adapt to the heterogeneous hardware characteristics of smartphones. - **Polymorphic neuron engine**: Dynamically adjust the calculation strategy according to different stages of inference (prefilling and decoding) to fully utilize the advantages of different hardware. - **Segmented neuron cache**: Design a special cache strategy to reduce I/O overhead and improve cache hit rate. - **Fine - grained neuron - cluster pipeline**: Overlap I/O operations with neuron - cluster calculations to effectively hide I/O latency. These methods enable PowerInfer - 2 to efficiently perform large - scale language - model inference on smartphones, significantly improving the inference speed and reducing memory usage. ### Formula examples Some formulas involved in the discussion can be represented in Markdown format as follows: - Matrix multiplication: $C = A\times B$ - Activation function (e.g., ReLU): $f(x)=\max(0, x)$ - Read bandwidth: $B_{read}=\frac{D}{T}$, where $D$ is the amount of data read and $T$ is the read time. Through these technical means, PowerInfer - 2 achieves efficient inference of large - language models on smartphones and solves the performance bottleneck problems of existing methods on mobile devices.

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

LLMCad: Fast and Scalable On-device Large Language Model Inference

Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management

Inference Performance Optimization for Large Language Models on CPUs

Large Language Models on Mobile Devices: Measurements, Analysis, and Insights

Imp: Highly Capable Large Multimodal Models for Mobile Devices

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Fast On-device LLM Inference with NPUs

Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation

Distributed Inference Performance Optimization for LLMs on CPUs

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

A Performance Evaluation of a Quantized Large Language Model on Various Smartphones

SparseByteNN: A Novel Mobile Inference Acceleration Framework Based on Fine-Grained Group Sparsity

Inference Acceleration for Large Language Models on CPUs

Fast Distributed Inference Serving for Large Language Models

CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation