LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models

Chi Han,Qifan Wang,Hao Peng,Wenhan Xiong,Yu Chen,Heng Ji,Sinong Wang
DOI: https://doi.org/10.48550/arXiv.2308.16137
2024-06-25
Abstract:Today's large language models (LLMs) typically train on short text segments (e.g., <4K tokens) due to the quadratic complexity of their Transformer architectures. As a result, their performance suffers drastically on inputs longer than those encountered during training, substantially limiting their applications in real-world tasks involving long contexts such as encoding scientific articles, code repositories, or long dialogues. Through theoretical analysis and empirical investigation, this work identifies three major factors contributing to this length generalization failure. Our theoretical analysis further reveals that commonly used techniques like truncating the attention window or relative positional encodings are inadequate to address them. Answering these challenges, we propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts. LM-Infinite is highly flexible and can be used with most modern LLMs off-the-shelf. Without any parameter updates, it allows LLMs pre-trained with 2K or 4K-long segments to generalize to up to 200M length inputs while retaining perplexity. It also improves performance on downstream tasks such as Passkey Retrieval and Qasper in the zero-shot setting. LM-Infinite brings substantial efficiency improvements: it achieves 2.7x decoding speed up and 7.5x memory saving over the original model. Our codes are released at \url{<a class="link-external link-https" href="https://github.com/Glaciohound/LM-Infinite" rel="external noopener nofollow">this https URL</a>}.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper "LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models" aims to address the issue of performance degradation in current large language models (LLMs) when handling extremely long texts. Specifically: 1. **Training Data Limitations**: Current large language models are typically trained on shorter text segments (e.g., fewer than 4K tokens) because the computational complexity of the Transformer architecture is quadratic with respect to input length. This leads to a sharp decline in performance when the model processes inputs longer than the training length. 2. **Practical Application Limitations**: This performance degradation severely limits the application of LLMs in tasks that require long contexts, such as encoding scientific articles, generating codebases, or long conversations. 3. **Limitations of Existing Techniques**: Existing techniques like sliding window attention patterns or relative position encoding, while helpful to some extent, still fail to fully address the length generalization problem. ### Solution To address these issues, the authors propose LM-Infinite, a simple yet effective method to enhance the ability of LLMs to handle long contexts without updating parameters. The main contributions of LM-Infinite include: 1. **Theoretical Analysis**: Through theoretical analysis and empirical studies, three main factors leading to length generalization failure are identified: - Handling unseen distances - Handling unseen token quantities - Implicit positional information in initial tokens 2. **Method Design**: LM-Infinite consists of two main components to mitigate the above factors: - **Λ-shaped Attention Mask**: Forces the model to focus only on the beginning and the most recent tokens of the sequence, ignoring the rest. - **Distance Cap**: Limits the relative distance values to the maximum seen during model training. 3. **Experimental Validation**: Experimental results show that LM-Infinite can significantly improve the performance of LLMs when handling extremely long inputs, while also improving downstream task performance in zero-shot settings, such as Passkey retrieval and Qasper. Additionally, LM-Infinite brings significant efficiency improvements, including a 2.7x increase in decoding speed and a 7.5x memory saving. ### Experimental Results - **Language Modeling**: Experiments on the ArXiv and OpenWebText2 datasets show that LM-Infinite can generalize the performance of various LLMs to inputs exceeding 200M in length while maintaining perplexity and generation quality in language modeling. - **Downstream Tasks**: Experiments on Passkey retrieval and Qasper tasks show that LM-Infinite significantly outperforms the original model and truncation baselines in zero-shot settings for Llama-2. In summary, LM-Infinite provides an effective method to address the performance degradation of large language models when handling extremely long texts, without requiring additional parameter updates, and has broad application prospects.