Resonance RoPE: Improving Context Length Generalization of Large Language Models

Suyuchen Wang,Ivan Kobyzev,Peng Lu,Mehdi Rezagholizadeh,Bang Liu
2024-06-10
Abstract:This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with out-of-distribution (OOD) token positions in longer sequences. We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios by refining the interpolation of RoPE features for OOD positions, significantly improving the model performance without additional online computational costs. Furthermore, we present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios, aiming to isolate the constantly increasing difficulty of token generation on long contexts from the challenges of recognizing new token positions. Our experiments on synthetic tasks show that after applying Resonance RoPE, Transformers recognize OOD position better and more robustly. Our extensive LLM experiments also show superior performance after applying Resonance RoPE to the current state-of-the-art RoPE scaling method, YaRN, on both upstream language modeling tasks and a variety of downstream long-text applications.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problems encountered in large - scale language models (LLMs) when using shorter sequences during training and dealing with longer sequences during testing, that is, the challenges in the "train - short - test - long" (TSTL) scenario. Specifically, these models usually use shorter text sequences during pre - training, but need to handle longer text sequences in practical applications, which will lead to a decline in the performance of the model when dealing with positions outside the pre - training range (out - of - distribution, OOD). ### Main problems 1. **Extrapolation problem of position embedding**: The traditional Rotary Position Embedding (RoPE) method will have a feature extrapolation problem when dealing with positions beyond the pre - training sequence length, resulting in poor performance of the model in long - text generation and understanding tasks. 2. **Interpolation problem**: Even within the pre - training range, the interpolation of RoPE features may also have a negative impact on the generalization ability of the model, especially when dealing with OOD positions. ### Solutions To address the above problems, the paper proposes the following solutions: 1. **RESONANCE ROPE**: - **Optimized interpolation**: By adjusting the wavelength of each RoPE feature to be aligned with integers, the interpolation error at OOD positions is reduced. Specifically, the paper proposes a method to round the wavelength of each RoPE feature to the nearest integer, so that these features can "resonate" better when dealing with OOD positions. - **No additional computational cost**: The implementation of RESONANCE ROPE only involves offline calculations and does not increase the computational overhead during online inference. 2. **POSGEN benchmark**: - **New synthetic benchmark**: In order to analyze the model behavior in the TSTL scenario more meticulously, the paper introduces the POSGEN benchmark. This benchmark aims to isolate the complexity of long - context generation from the challenge of identifying new positions. By keeping the difficulty of generating tokens consistent throughout the sequence, it is ensured that the observed deficiencies are directly related to the model's ability to identify new positions. ### Experimental verification The paper verifies the effectiveness of RESONANCE ROPE and POSGEN through the following experiments: 1. **Synthetic task evaluation**: - **POSGEN subtasks**: Experiments on three subtasks were carried out on the POSGEN benchmark, namely the recursive task, the chain - of - thought task, and the semi - recursive task. The results show that RESONANCE ROPE and RESONANCE YaRN perform significantly better than traditional methods at OOD positions. - **Validation loss curve**: By plotting the validation loss curves of different position embedding methods during the training process, the effectiveness of the RESONANCE technology is further proved. 2. **Large - scale language model evaluation**: - **LLaMA2 - Chat**: Fine - tuning experiments were carried out on the LLaMA2 - Chat model to evaluate the performance of different position embedding methods in long - text tasks. The results show that RESONANCE YaRN performs well in multiple downstream tasks, especially when dealing with OOD positions. ### Conclusion Through RESONANCE ROPE and the POSGEN benchmark, the paper effectively solves the performance degradation problem of large - scale language models in handling long texts in the TSTL scenario, providing new tools and methods for future research.