Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Tsendsuren Munkhdalai,Manaal Faruqui,Siddharth Gopal

2024-08-10

Abstract:This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.

Computation and Language,Artificial Intelligence,Machine Learning,Neural and Evolutionary Computing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the memory and computational resource limitations encountered by existing Transformers and their derived large - language models (LLMs) when processing extremely long input sequences. Specifically, due to the characteristics of the attention mechanism, the standard Transformer architecture will face problems of excessive memory occupation and overly long computation time when processing input sequences exceeding a certain length. For example, for a model with 500 billion parameters, when the batch size is 512 and the context length is 2,048, the attention Key - Value (KV) state requires 3 terabytes of memory space (Pope et al., 2023). Therefore, expanding LLMs to longer sequences (such as 1 million tokens) is extremely challenging and costly under the standard Transformer architecture. To solve this problem, the paper proposes a new method - Infini - attention, which enables Transformer - based LLMs to process infinitely long inputs with limited memory and computational resources. Infini - attention achieves this by introducing compressed memory into the traditional attention mechanism. It combines local masked attention and long - term linear attention mechanisms, thereby simultaneously handling short - distance and long - distance context dependencies within a Transformer block. This method not only supports the continuous pre - training of the model and long - context adaptation but also can improve the processing ability for long - context tasks while maintaining low memory usage, such as long - context language modeling, key retrieval for 1 - million - sequence - length, and book - summary - generation tasks with a length of 500,000.

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

ReAttention: Training-Free Infinite Context with Finite Attention Scope

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Empower Your Model with Longer and Better Context Comprehension

LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models

In-Context Former: Lightning-fast Compressing Context for Large Language Model

InfiniPot: Infinite Context Processing on Memory-Constrained LLMs

Squeezed Attention: Accelerating Long Context Length LLM Inference

Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free Dynamic Triangular Attention Pattern

InAttention: Linear Context Scaling for Transformers

Recycled Attention: Efficient inference for long-context language models

CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory

X-former Elucidator: Reviving Efficient Attention for Long Context Language Modeling

UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs

Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning and Context Length Extension

LongHeads: Multi-Head Attention is Secretly a Long Context Processor

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

Long-Context Language Modeling with Parallel Context Encoding