Empower Your Model with Longer and Better Context Comprehension

Yifei Gao,Lei Wang,Jun Fang,Longhua Hu,Jun Cheng
2023-07-27
Abstract:Recently, with the emergence of numerous Large Language Models (LLMs), the implementation of AI has entered a new era. Irrespective of these models' own capacity and structure, there is a growing demand for LLMs to possess enhanced comprehension of longer and more complex contexts with relatively smaller sizes. Models often encounter an upper limit when processing sequences of sentences that extend beyond their comprehension capacity and result in off-topic or even chaotic responses. While several recent works attempt to address this issue in various ways, they rarely focus on "why models are unable to compensate or strengthen their capabilities on their own". In this paper, we thoroughly investigate the nature of information transfer within LLMs and propose a novel technique called Attention Transition. This technique empowers models to achieve longer and better context comprehension with minimal additional training or impact on generation fluency. Our experiments are conducted on the challenging XSum dataset using LLaMa-7b model with context token length ranging from 800 to 1900. Results demonstrate that we achieve substantial improvements compared with the original generation results evaluated by GPT4.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper mainly focuses on the limitations of large language models (LLMs) in handling long and complex contexts. Despite the impressive performance of existing LLMs, they may produce irrelevant or confusing responses when it comes to understanding sequences beyond a certain length. Researchers have noted that although some previous work has attempted to address this issue, few studies have investigated why the models themselves fail to enhance their capabilities. To address this, the paper proposes a new approach called "Attention Transition" which enables the model to better understand and process longer contexts with minimal additional training impact and while maintaining fluency in generation. Experiments are conducted using the challenging XSum dataset with the LLaMa-7b model, demonstrating significant improvements in generated results with the Attention Transition technique compared to the original results, and indicating strong generalization potential for current LLMs. The paper also explores the influence of attention weights on text generation and the role of rotation embeddings in information propagation. The authors find that while rotation embeddings help attenuate long-distance information, they restrict the model's understanding of long-distance information. Their proposed method enhances inter-layer information propagation by eliminating unimportant attention weights and reallocating them to important information. Experimental results show that the LLaMa-7b model achieves improvements in understanding different lengths of context, especially when dealing with longer sequences, after incorporating Attention Transition technique. The paper also conducts ablation studies to analyze the impact of parameter choices on the results and points out that excessive use of attention extensions may lead to a decrease in generation quality. In summary, the problem this paper attempts to address is how to enable large language models to effectively understand and process more complex long-text contexts even with smaller scales. Through the Attention Transition technique, the authors provide a solution that does not require additional training and enhances the model's contextual understanding capability.