YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng,Jeffrey Quesnelle,Honglu Fan,Enrico Shippole
2023-11-02
Abstract:Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. The models fine-tuned using YaRN has been made available and reproduced online up to 128k context length at <a class="link-external link-https" href="https://github.com/jquesnelle/yarn" rel="external noopener nofollow">this https URL</a>
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to effectively extend the context window length of large-scale language models based on Rotary Position Embedding (RoPE) so that they can handle longer texts than those seen during pre-training. Specifically, existing large-scale language models (such as LLaMA, GPT-NeoX, and PaLM) have limitations when processing long sequences because their context window length is constrained by the settings during the pre-training phase. The paper proposes a new method—YaRN (Yet another RoPE extensioN method), which aims to efficiently extend the context window length of these models with minimal fine-tuning or even without fine-tuning. The main contributions of YaRN include: 1. **Efficient context window extension**: YaRN requires only 10 times fewer tokens and 2.5 times fewer training steps to achieve better results than existing methods. 2. **Surpassing existing methods**: Experimental results show that the LLaMA model with an extended context window using YaRN performs better in handling long sequences compared to other existing methods. 3. **Generalization ability**: YaRN not only performs well within the limited context of the fine-tuning dataset but also effectively extrapolates to unseen longer contexts. In summary, by proposing the YaRN method, this paper addresses the limitations of existing large-scale language models in processing long sequences, providing stronger support for practical applications.