Abstract:The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at <a class="link-external link-https" href="https://aka.ms/MInference" rel="external noopener nofollow">this https URL</a>.

Tabi: an Efficient Multi-Level Inference System for Large Language Models

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

LiveMind: Low-latency Large Language Models with Simultaneous Inference

A Survey on Efficient Inference for Large Language Models

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models

Efficient and Economic Large Language Model Inference with Attention Offloading

Efficient Large Foundation Model Inference: A Perspective From Model and System Co-Design

Minions: Accelerating Large Language Model Inference with Aggregated Speculative Execution

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Large Language Model Inference Acceleration Based on Hybrid Model Branch Prediction

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

Inference Performance Optimization for Large Language Models on CPUs

QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

LLMCad: Fast and Scalable On-device Large Language Model Inference

Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE

Fast Distributed Inference Serving for Large Language Models

Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models