Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective

Meizhi Zhong,Chen Zhang,Yikun Lei,Xikai Liu,Yan Gao,Yao Hu,Kehai Chen,Min Zhang

2024-10-29

Abstract:Enabling LLMs to handle lengthy context is currently a research hotspot. Most LLMs are built upon rotary position embedding (RoPE), a popular position encoding method. Therefore, a prominent path is to extrapolate the RoPE trained on comparably short texts to far longer texts. A heavy bunch of efforts have been dedicated to boosting the extrapolation via extending the formulations of the RoPE, however, few of them have attempted to showcase their inner workings comprehensively. In this paper, we are driven to offer a straightforward yet in-depth understanding of RoPE extensions from an attention perspective and on two benchmarking tasks. A broad array of experiments reveals several valuable findings: 1) Maintaining attention patterns to those at the pretrained length improves extrapolation; 2) Large attention uncertainty leads to retrieval errors; 3) Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the ability of large language models (LLMs) to handle long - text contexts. Specifically, most large language models use the Rotary Position Embedding (RoPE) method to process position information, but when it is necessary to process long texts beyond the pre - training length, the performance of RoPE will decline significantly. Therefore, researchers try to improve the model's performance on long texts by extending the RoPE method. The main contributions of the paper include: 1. Systematically analyze common RoPE extension methods from the perspective of the attention mechanism, and find that the effectiveness of these methods mainly comes from maintaining the original attention pattern. 2. Use the more challenging "Needle - in - a - Haystack Test" to further analyze these methods, and observe that in some cases, although the RoPE extension method can improve the long - text processing ability, it is still difficult to effectively extrapolate in areas with high attention uncertainty. 3. Assume that high attention uncertainty is caused by the mismatch of context lengths during training and inference, and propose to reduce this uncertainty by continuous training on longer contexts, thereby enhancing the model's ability to handle long texts. Through these studies, the paper provides valuable insights for understanding and optimizing RoPE extension methods.

Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

On the token distance modeling ability of higher RoPE attention dimension

Extending LLMs' Context Window with 100 Samples

Scaling Laws of RoPE-based Extrapolation.

When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

Base of RoPE Bounds Context Length

A Controlled Study on Long Context Extension and Generalization in LLMs

HiRoPE: Length Extrapolation for Code Models Using Hierarchical Position

LongEmbed: Extending Embedding Models for Long Context Retrieval

Extending Context Window of Large Language Models from a Distributional Perspective

3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding

Resonance RoPE: Improving Context Length Generalization of Large Language Models

Mixture of In-Context Experts Enhance LLMs' Long Context Awareness

HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation

E^2-LLM: Efficient and Extreme Length Extension of Large Language Models

LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training

YaRN: Efficient Context Window Extension of Large Language Models

ReAttention: Training-Free Infinite Context with Finite Attention Scope