MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Huiqiang Jiang,Yucheng Li,Chengruidong Zhang,Qianhui Wu,Xufang Luo,Surin Ahn,Zhenhua Han,Amir H. Abdi,Dongsheng Li,Chin-Yew Lin,Yuqing Yang,Lili Qiu

2024-10-30

Abstract:The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at <a class="link-external link-https" href="https://aka.ms/MInference" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

This paper attempts to address the computational challenges in the inference process of long - context large language models (LLMs). In particular, as the length of the input prompt keeps increasing, the amount of computation required in the pre - filling stage (i.e., the process of processing the input prompt to start generating the first token) increases dramatically, resulting in unacceptable latency. Specifically, for an 8 - billion - parameter LLM, it takes 30 minutes to process a 1 - million - token prompt on a single A100 GPU, which greatly hinders the wide application of long - context LLMs. Existing methods for accelerating pre - filling often fail to maintain acceptable accuracy or efficiency when applied to long - context LLMs. To solve this problem, the paper proposes MInference (Million - Token Inference), a technique for accelerating pre - filling in long - sequence processing through a dynamic sparse - attention - calculation method. MInference identifies three unique patterns in the long - context attention matrix - A - shaped, vertical - slant, and block - sparse patterns - and constructs dynamic sparse indexes based on these patterns, thereby achieving efficient sparse computation. This method can be directly applied to existing LLMs without modifying the pre - training settings or performing additional fine - tuning, significantly reducing the latency in the pre - filling stage of long - context LLMs while maintaining the accuracy of the model.

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Squeezed Attention: Accelerating Long Context Length LLM Inference

Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs

CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs

Fast On-device LLM Inference with NPUs

Efficient LLM inference solution on Intel GPU

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy

Self-Selected Attention Span for Accelerating Large Language Model Inference

Context Parallelism for Scalable Million-Token Inference

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

LiveMind: Low-latency Large Language Models with Simultaneous Inference

CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

SparQ Attention: Bandwidth-Efficient LLM Inference