Exploring Context Window of Large Language Models via Decomposed Positional Vectors

Zican Dong,Junyi Li,Xin Men,Wayne Xin Zhao,Bingbing Wang,Zhen Tian,Weipeng Chen,Ji-Rong Wen
2024-05-28
Abstract:Transformer-based large language models (LLMs) typically have a limited context window, resulting in significant performance degradation when processing text beyond the length of the context window. Extensive studies have been proposed to extend the context window and achieve length extrapolation of LLMs, but there is still a lack of in-depth interpretation of these approaches. In this study, we explore the positional information within and beyond the context window for deciphering the underlying mechanism of LLMs. By using a mean-based decomposition method, we disentangle positional vectors from hidden states of LLMs and analyze their formation and effect on attention. Furthermore, when texts exceed the context window, we analyze the change of positional vectors in two settings, i.e., direct extrapolation and context window extension. Based on our findings, we design two training-free context window extension methods, positional vector replacement and attention window extension. Experimental results show that our methods can effectively extend the context window length.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **The performance of Transformer - based large language models (LLMs) drops significantly when processing texts that exceed the length of their context window**. Specifically, when the input sequence exceeds the maximum length of the training data (i.e., the context window size), the position encoding becomes out - of - distribution (OOD), resulting in a significant drop in model performance, such as a sharp increase in the perplexity (PPL) score. ### Main Problem Analysis 1. **Limitations of the Context Window**: - Transformer - based LLMs usually have a limited context window, which causes their performance to drop substantially when processing texts that exceed the length of this window. - When the length of the input sequence exceeds the context window, the position encoding becomes out - of - distribution (OOD), causing the model to be unable to effectively process this position information. 2. **Deficiencies of Existing Methods**: - Although previous research has proposed various methods for expanding the context window, most of these methods focus on adjusting the position encoding or the attention score, lacking in - depth analysis of the internal mechanism of the hidden state. ### Solutions To address the above problems, this paper has explored and improved in the following ways: - **Decomposing the Position Vector**: - Use a mean - based decomposition method to separate the position vector from the hidden state and analyze its formation and the impact on the attention mechanism. - **Changes in the Position Vector when Exceeding the Context Window**: - Analyze the changes in the position vector in two settings: direct extrapolation and context window expansion. - **Propose Two Training - Free Context Window Expansion Methods**: - **Positional Vector Replacement**: Replace the original position vector with the interpolated position vector to avoid the OOD problem. - **Attention Window Extension**: Expand the size of the attention window to control the formation of the position vector and use the scaling factor λ to adjust the attention score. ### Experimental Results The experimental results show that these two methods can effectively expand the length of the context window without additional fine - tuning training. In particular, when processing longer texts, these methods can maintain performance comparable to existing methods. ### Summary The main contributions of this paper are: 1. Clarify the formation process of the position vector and its impact on long - term attenuation and the attention convergence point. 2. For the first time, unify length extrapolation and context window expansion from the perspective of the position vector, and point out that preventing OOD position vectors is the key to avoiding performance degradation. 3. Propose two training - free context window expansion methods, and the experimental results verify the effectiveness of these methods. Through these studies, this paper provides new perspectives and methods for understanding and improving the performance of large language models when processing long texts.