Abstract:To extend the context length of Transformer-based large language models (LLMs) and improve comprehension capabilities, we often face limitations due to computational resources and bounded memory storage capacity. This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of LLMs within constrained storage space. We also investigate the issue of poor model responses when both instructions and context are compressed in downstream tasks, and propose an instruction reconstruction method to mitigate this problem. We validated the effectiveness of our approach on multiple tasks, achieving a compression rate of up to 32x on text reconstruction tasks with a BLEU4 score close to 0.95, and nearly 100\% accuracy on a passkey retrieval task with a sequence length of 1M. Finally, our method demonstrated competitive performance in long-text question-answering tasks compared to non-compressed methods, while significantly saving storage resources in long-text inference tasks. Our code, models, and demo are available at <a class="link-external link-https" href="https://github.com/WUHU-G/RCC_Transformer" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

This paper focuses on how to effectively expand the context window length of large-scale language models (LLMs) based on Transformer while improving comprehension ability within limited storage space. Existing methods are limited by computing resources and memory storage capacity. To address this, the paper proposes a method called Recurrent Context Compression (RCC) to efficiently expand the context window of LLMs within the constrained storage space. The RCC model compresses the context through an autoencoder structure to reduce information loss and improve compression efficiency. Experiments show that this method achieves a compression ratio of up to 32x in text reconstruction tasks, with a BLEU4 score close to 0.95, and achieves nearly 100% accuracy in key retrieval tasks with 1M sequence length. Furthermore, RCC demonstrates competitive performance in long text question-answering tasks compared to non-compression methods, while significantly saving storage resources for long text inference tasks. The paper also proposes a new training method to adapt the long text context compression language model by overcoming the context window limitation of the encoder through a recurrent compression mechanism. In downstream tasks, when both instructions and context are compressed, the model often fails to follow the instructions correctly, leading to a degradation in response quality. To address this issue, the paper proposes a method that utilizes the text reconstruction capability of the context compression language model to reconstruct the content of instructions, thereby significantly improving output quality when both are compressed. In summary, the paper addresses the efficiency, scalability, and instruction confusion issues of existing context compression methods in handling long text processing, providing improvements for LLMs with long text inputs.

Recurrent Context Compression: Efficiently Expanding the Context Window of LLM

Extending Context Window of Large Language Models via Semantic Compression

In-Context Former: Lightning-fast Compressing Context for Large Language Model

Long Context Compression with Activation Beacon

Context Compression for Auto-regressive Transformers with Sentinel Tokens

Adapting Language Models to Compress Contexts

Compressed Context Memory For Online Language Model Interaction

Context Compression and Extraction: Efficiency Inference of Large Language Models

Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

Adapting LLMs for Efficient Context Processing through Soft Prompt Compression

SCA: Selective Compression Attention for Efficiently Extending the Context Window of Large Language Models

Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference

LoCoCo: Dropping In Convolutions for Long Context Compression

Training-Free Exponential Context Extension via Cascading KV Cache

Efficient Long Context Language Model Retrieval with Compression

KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

Efficient Large Multi-modal Models via Visual Context Compression

LLoCO: Learning Long Contexts Offline

Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering

Corner-to-Center Long-range Context Model for Efficient Learned Image Compression