Abstract:Transformer-based large language models (LLMs) typically have a limited context window, resulting in significant performance degradation when processing text beyond the length of the context window. Extensive studies have been proposed to extend the context window and achieve length extrapolation of LLMs, but there is still a lack of in-depth interpretation of these approaches. In this study, we explore the positional information within and beyond the context window for deciphering the underlying mechanism of LLMs. By using a mean-based decomposition method, we disentangle positional vectors from hidden states of LLMs and analyze their formation and effect on attention. Furthermore, when texts exceed the context window, we analyze the change of positional vectors in two settings, i.e., direct extrapolation and context window extension. Based on our findings, we design two training-free context window extension methods, positional vector replacement and attention window extension. Experimental results show that our methods can effectively extend the context window length.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **The performance of Transformer - based large language models (LLMs) drops significantly when processing texts that exceed the length of their context window**. Specifically, when the input sequence exceeds the maximum length of the training data (i.e., the context window size), the position encoding becomes out - of - distribution (OOD), resulting in a significant drop in model performance, such as a sharp increase in the perplexity (PPL) score. ### Main Problem Analysis 1. **Limitations of the Context Window**: - Transformer - based LLMs usually have a limited context window, which causes their performance to drop substantially when processing texts that exceed the length of this window. - When the length of the input sequence exceeds the context window, the position encoding becomes out - of - distribution (OOD), causing the model to be unable to effectively process this position information. 2. **Deficiencies of Existing Methods**: - Although previous research has proposed various methods for expanding the context window, most of these methods focus on adjusting the position encoding or the attention score, lacking in - depth analysis of the internal mechanism of the hidden state. ### Solutions To address the above problems, this paper has explored and improved in the following ways: - **Decomposing the Position Vector**: - Use a mean - based decomposition method to separate the position vector from the hidden state and analyze its formation and the impact on the attention mechanism. - **Changes in the Position Vector when Exceeding the Context Window**: - Analyze the changes in the position vector in two settings: direct extrapolation and context window expansion. - **Propose Two Training - Free Context Window Expansion Methods**: - **Positional Vector Replacement**: Replace the original position vector with the interpolated position vector to avoid the OOD problem. - **Attention Window Extension**: Expand the size of the attention window to control the formation of the position vector and use the scaling factor λ to adjust the attention score. ### Experimental Results The experimental results show that these two methods can effectively expand the length of the context window without additional fine - tuning training. In particular, when processing longer texts, these methods can maintain performance comparable to existing methods. ### Summary The main contributions of this paper are: 1. Clarify the formation process of the position vector and its impact on long - term attenuation and the attention convergence point. 2. For the first time, unify length extrapolation and context window expansion from the perspective of the position vector, and point out that preventing OOD position vectors is the key to avoiding performance degradation. 3. Propose two training - free context window expansion methods, and the experimental results verify the effectiveness of these methods. Through these studies, this paper provides new perspectives and methods for understanding and improving the performance of large language models when processing long texts.

Exploring Context Window of Large Language Models via Decomposed Positional Vectors

Extending LLMs' Context Window with 100 Samples

Why Does the Effective Context Length of LLMs Fall Short?

Long-Context Language Modeling with Parallel Context Encoding

Visual Context Window Extension: A New Perspective for Long Video Understanding

Extending Context Window of Large Language Models via Semantic Compression

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

Parallel Context Windows for Large Language Models

CLEX: Continuous Length Extrapolation for Large Language Models

Empower Your Model with Longer and Better Context Comprehension

Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention

PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training

Retrieval meets Long Context Large Language Models

E^2-LLM: Efficient and Extreme Length Extension of Large Language Models

CoCA: Fusing Position Embedding with Collinear Constrained Attention in Transformers for Long Context Window Extending

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding

Extensible Embedding: A Flexible Multipler For LLM's Context Length

A Controlled Study on Long Context Extension and Generalization in LLMs

Extending Context Window of Large Language Models from a Distributional Perspective

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens