Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective

Meizhi Zhong,Chen Zhang,Yikun Lei,Xikai Liu,Yan Gao,Yao Hu,Kehai Chen,Min Zhang
2024-10-29
Abstract:Enabling LLMs to handle lengthy context is currently a research hotspot. Most LLMs are built upon rotary position embedding (RoPE), a popular position encoding method. Therefore, a prominent path is to extrapolate the RoPE trained on comparably short texts to far longer texts. A heavy bunch of efforts have been dedicated to boosting the extrapolation via extending the formulations of the RoPE, however, few of them have attempted to showcase their inner workings comprehensively. In this paper, we are driven to offer a straightforward yet in-depth understanding of RoPE extensions from an attention perspective and on two benchmarking tasks. A broad array of experiments reveals several valuable findings: 1) Maintaining attention patterns to those at the pretrained length improves extrapolation; 2) Large attention uncertainty leads to retrieval errors; 3) Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the ability of large language models (LLMs) to handle long - text contexts. Specifically, most large language models use the Rotary Position Embedding (RoPE) method to process position information, but when it is necessary to process long texts beyond the pre - training length, the performance of RoPE will decline significantly. Therefore, researchers try to improve the model's performance on long texts by extending the RoPE method. The main contributions of the paper include: 1. Systematically analyze common RoPE extension methods from the perspective of the attention mechanism, and find that the effectiveness of these methods mainly comes from maintaining the original attention pattern. 2. Use the more challenging "Needle - in - a - Haystack Test" to further analyze these methods, and observe that in some cases, although the RoPE extension method can improve the long - text processing ability, it is still difficult to effectively extrapolate in areas with high attention uncertainty. 3. Assume that high attention uncertainty is caused by the mismatch of context lengths during training and inference, and propose to reduce this uncertainty by continuous training on longer contexts, thereby enhancing the model's ability to handle long texts. Through these studies, the paper provides valuable insights for understanding and optimizing RoPE extension methods.