YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng,Jeffrey Quesnelle,Honglu Fan,Enrico Shippole

2023-11-02

Abstract:Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. The models fine-tuned using YaRN has been made available and reproduced online up to 128k context length at <a class="link-external link-https" href="https://github.com/jquesnelle/yarn" rel="external noopener nofollow">this https URL</a>

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to effectively extend the context window length of large-scale language models based on Rotary Position Embedding (RoPE) so that they can handle longer texts than those seen during pre-training. Specifically, existing large-scale language models (such as LLaMA, GPT-NeoX, and PaLM) have limitations when processing long sequences because their context window length is constrained by the settings during the pre-training phase. The paper proposes a new method—YaRN (Yet another RoPE extensioN method), which aims to efficiently extend the context window length of these models with minimal fine-tuning or even without fine-tuning. The main contributions of YaRN include: 1. **Efficient context window extension**: YaRN requires only 10 times fewer tokens and 2.5 times fewer training steps to achieve better results than existing methods. 2. **Surpassing existing methods**: Experimental results show that the LLaMA model with an extended context window using YaRN performs better in handling long sequences compared to other existing methods. 3. **Generalization ability**: YaRN not only performs well within the limited context of the fine-tuning dataset but also effectively extrapolates to unseen longer contexts. In summary, by proposing the YaRN method, this paper addresses the limitations of existing large-scale language models in processing long sequences, providing stronger support for practical applications.

YaRN: Efficient Context Window Extension of Large Language Models

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Extending LLMs' Context Window with 100 Samples

Extending Context Window of Large Language Models from a Distributional Perspective

Resonance RoPE: Improving Context Length Generalization of Large Language Models

LongEmbed: Extending Embedding Models for Long Context Retrieval

E^2-LLM: Efficient and Extreme Length Extension of Large Language Models

PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training

Why Does the Effective Context Length of LLMs Fall Short?

Long-Context Language Modeling with Parallel Context Encoding

Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective

Scaling Laws of RoPE-based Extrapolation.

HiRoPE: Length Extrapolation for Code Models Using Hierarchical Position

An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding

Exploring Context Window of Large Language Models via Decomposed Positional Vectors

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models

Long Context RAG Performance of Large Language Models

CLEX: Continuous Length Extrapolation for Large Language Models

Parallel Context Windows for Large Language Models

Base of RoPE Bounds Context Length

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization