Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Tsendsuren Munkhdalai,Manaal Faruqui,Siddharth Gopal
2024-08-10
Abstract:This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.
Computation and Language,Artificial Intelligence,Machine Learning,Neural and Evolutionary Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the memory and computational resource limitations encountered by existing Transformers and their derived large - language models (LLMs) when processing extremely long input sequences. Specifically, due to the characteristics of the attention mechanism, the standard Transformer architecture will face problems of excessive memory occupation and overly long computation time when processing input sequences exceeding a certain length. For example, for a model with 500 billion parameters, when the batch size is 512 and the context length is 2,048, the attention Key - Value (KV) state requires 3 terabytes of memory space (Pope et al., 2023). Therefore, expanding LLMs to longer sequences (such as 1 million tokens) is extremely challenging and costly under the standard Transformer architecture. To solve this problem, the paper proposes a new method - Infini - attention, which enables Transformer - based LLMs to process infinitely long inputs with limited memory and computational resources. Infini - attention achieves this by introducing compressed memory into the traditional attention mechanism. It combines local masked attention and long - term linear attention mechanisms, thereby simultaneously handling short - distance and long - distance context dependencies within a Transformer block. This method not only supports the continuous pre - training of the model and long - context adaptation but also can improve the processing ability for long - context tasks while maintaining low memory usage, such as long - context language modeling, key retrieval for 1 - million - sequence - length, and book - summary - generation tasks with a length of 500,000.