Abstract:Recently, with the emergence of numerous Large Language Models (LLMs), the implementation of AI has entered a new era. Irrespective of these models' own capacity and structure, there is a growing demand for LLMs to possess enhanced comprehension of longer and more complex contexts with relatively smaller sizes. Models often encounter an upper limit when processing sequences of sentences that extend beyond their comprehension capacity and result in off-topic or even chaotic responses. While several recent works attempt to address this issue in various ways, they rarely focus on "why models are unable to compensate or strengthen their capabilities on their own". In this paper, we thoroughly investigate the nature of information transfer within LLMs and propose a novel technique called Attention Transition. This technique empowers models to achieve longer and better context comprehension with minimal additional training or impact on generation fluency. Our experiments are conducted on the challenging XSum dataset using LLaMa-7b model with context token length ranging from 800 to 1900. Results demonstrate that we achieve substantial improvements compared with the original generation results evaluated by GPT4.

What problem does this paper attempt to address?

This paper mainly focuses on the limitations of large language models (LLMs) in handling long and complex contexts. Despite the impressive performance of existing LLMs, they may produce irrelevant or confusing responses when it comes to understanding sequences beyond a certain length. Researchers have noted that although some previous work has attempted to address this issue, few studies have investigated why the models themselves fail to enhance their capabilities. To address this, the paper proposes a new approach called "Attention Transition" which enables the model to better understand and process longer contexts with minimal additional training impact and while maintaining fluency in generation. Experiments are conducted using the challenging XSum dataset with the LLaMa-7b model, demonstrating significant improvements in generated results with the Attention Transition technique compared to the original results, and indicating strong generalization potential for current LLMs. The paper also explores the influence of attention weights on text generation and the role of rotation embeddings in information propagation. The authors find that while rotation embeddings help attenuate long-distance information, they restrict the model's understanding of long-distance information. Their proposed method enhances inter-layer information propagation by eliminating unimportant attention weights and reallocating them to important information. Experimental results show that the LLaMa-7b model achieves improvements in understanding different lengths of context, especially when dealing with longer sequences, after incorporating Attention Transition technique. The paper also conducts ablation studies to analyze the impact of parameter choices on the results and points out that excessive use of attention extensions may lead to a decrease in generation quality. In summary, the problem this paper attempts to address is how to enable large language models to effectively understand and process more complex long-text contexts even with smaller scales. Through the Attention Transition technique, the authors provide a solution that does not require additional training and enhances the model's contextual understanding capability.

Empower Your Model with Longer and Better Context Comprehension

Extending Context Window of Large Language Models via Semantic Compression

A Controlled Study on Long Context Extension and Generalization in LLMs

Long-Context Language Modeling with Parallel Context Encoding

Adapting LLMs for Efficient Context Processing through Soft Prompt Compression

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

LooGLE: Can Long-Context Language Models Understand Long Contexts?

X-former Elucidator: Reviving Efficient Attention for Long Context Language Modeling

Large Language Models Can Self-Improve in Long-context Reasoning

Retrieval meets Long Context Large Language Models

How to Train Long-Context Language Models (Effectively)

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

Exploring Context Window of Large Language Models via Decomposed Positional Vectors

Can Large Language Models Understand Context?

Two are better than one: Context window extension with multi-grained self-injection

UniMem: Towards a Unified View of Long-Context Large Language Models

Training With "Paraphrasing the Original Text'' Improves Long-Context Performance

Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models

ACER: Automatic Language Model Context Extension via Retrieval

Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding

Why Does the Effective Context Length of LLMs Fall Short?