Abstract:Today's large language models (LLMs) typically train on short text segments (e.g., <4K tokens) due to the quadratic complexity of their Transformer architectures. As a result, their performance suffers drastically on inputs longer than those encountered during training, substantially limiting their applications in real-world tasks involving long contexts such as encoding scientific articles, code repositories, or long dialogues. Through theoretical analysis and empirical investigation, this work identifies three major factors contributing to this length generalization failure. Our theoretical analysis further reveals that commonly used techniques like truncating the attention window or relative positional encodings are inadequate to address them. Answering these challenges, we propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts. LM-Infinite is highly flexible and can be used with most modern LLMs off-the-shelf. Without any parameter updates, it allows LLMs pre-trained with 2K or 4K-long segments to generalize to up to 200M length inputs while retaining perplexity. It also improves performance on downstream tasks such as Passkey Retrieval and Qasper in the zero-shot setting. LM-Infinite brings substantial efficiency improvements: it achieves 2.7x decoding speed up and 7.5x memory saving over the original model. Our codes are released at \url{<a class="link-external link-https" href="https://github.com/Glaciohound/LM-Infinite" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper "LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models" aims to address the issue of performance degradation in current large language models (LLMs) when handling extremely long texts. Specifically: 1. **Training Data Limitations**: Current large language models are typically trained on shorter text segments (e.g., fewer than 4K tokens) because the computational complexity of the Transformer architecture is quadratic with respect to input length. This leads to a sharp decline in performance when the model processes inputs longer than the training length. 2. **Practical Application Limitations**: This performance degradation severely limits the application of LLMs in tasks that require long contexts, such as encoding scientific articles, generating codebases, or long conversations. 3. **Limitations of Existing Techniques**: Existing techniques like sliding window attention patterns or relative position encoding, while helpful to some extent, still fail to fully address the length generalization problem. ### Solution To address these issues, the authors propose LM-Infinite, a simple yet effective method to enhance the ability of LLMs to handle long contexts without updating parameters. The main contributions of LM-Infinite include: 1. **Theoretical Analysis**: Through theoretical analysis and empirical studies, three main factors leading to length generalization failure are identified: - Handling unseen distances - Handling unseen token quantities - Implicit positional information in initial tokens 2. **Method Design**: LM-Infinite consists of two main components to mitigate the above factors: - **Λ-shaped Attention Mask**: Forces the model to focus only on the beginning and the most recent tokens of the sequence, ignoring the rest. - **Distance Cap**: Limits the relative distance values to the maximum seen during model training. 3. **Experimental Validation**: Experimental results show that LM-Infinite can significantly improve the performance of LLMs when handling extremely long inputs, while also improving downstream task performance in zero-shot settings, such as Passkey retrieval and Qasper. Additionally, LM-Infinite brings significant efficiency improvements, including a 2.7x increase in decoding speed and a 7.5x memory saving. ### Experimental Results - **Language Modeling**: Experiments on the ArXiv and OpenWebText2 datasets show that LM-Infinite can generalize the performance of various LLMs to inputs exceeding 200M in length while maintaining perplexity and generation quality in language modeling. - **Downstream Tasks**: Experiments on Passkey retrieval and Qasper tasks show that LM-Infinite significantly outperforms the original model and truncation baselines in zero-shot settings for Llama-2. In summary, LM-Infinite provides an effective method to address the performance degradation of large language models when handling extremely long texts, without requiring additional parameter updates, and has broad application prospects.

LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models

XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

Language Models can Self-Lengthen to Generate Long Texts

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

CLEX: Continuous Length Extrapolation for Large Language Models

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Extending Context Window of Large Language Models via Semantic Compression

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models

Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models

E^2-LLM: Efficient and Extreme Length Extension of Large Language Models

Training-Free Long-Context Scaling of Large Language Models

Efficient Solutions For An Intriguing Failure of LLMs: Long Context Window Does Not Mean LLMs Can Analyze Long Sequences Flawlessly

Large Language Models are Strong Zero-Shot Retriever

Why Does the Effective Context Length of LLMs Fall Short?

LongVLM: Efficient Long Video Understanding via Large Language Models

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

LLM×MapReduce: Simplified Long-Sequence Processing Using Large Language Models