Abstract:Transformer-based Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses. This constraint restricts their applicability in scenarios involving long texts. We propose a novel semantic compression method that enables generalization to texts that are 6-8 times longer, without incurring significant computational costs or requiring fine-tuning. Our proposed framework draws inspiration from source coding in information theory and employs a pre-trained model to reduce the semantic redundancy of long inputs before passing them to the LLMs for downstream tasks. Experimental results demonstrate that our method effectively extends the context window of LLMs across a range of tasks including question answering, summarization, few-shot learning, and information retrieval. Furthermore, the proposed semantic compression method exhibits consistent fluency in text generation while reducing the associated computational overhead.
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
This paper aims to address the limitations faced by large language models (LLMs) when handling long text inputs. Specifically, current LLMs typically have strict limitations on the length of input text to ensure the generation of fluent and relevant content. These limitations cause LLMs to perform poorly when processing long texts such as scientific papers, novels, and legal contracts. To overcome this limitation, the authors propose a new semantic compression method that can extend the text length to 6-8 times the original without significantly increasing computational costs or requiring fine-tuning.
### Background and Motivation
1. **Limitations of Existing LLMs**:
- Current LLMs experience a sharp decline in performance when the input context exceeds a certain length.
- This limitation mainly stems from the self-attention mechanism in Transformer models, whose computational complexity grows quadratically with sequence length, leading to enormous memory and time consumption when processing long texts.
- Training new models to accommodate longer input sequences is costly, and some existing methods (such as position encoding interpolation) can partially address the issue but still require substantial time and GPU resources.
2. **Inspiration for Semantic Compression**:
- The authors draw inspiration from the concept of source coding in information theory, using pre-trained models to reduce semantic redundancy in long inputs.
- By compressing long texts, the input length can be significantly shortened while preserving semantic meaning, thereby extending the context window of LLMs.
### Method Overview
1. **Semantic Compression Framework**:
- The method first segments the input text into theme-based chunks and then uses a pre-trained model to compress each chunk.
- The compressed text chunks are then recombined to form a simplified input for the LLMs to process.
2. **Technical Details**:
- **Text Segmentation**: A weighted graph representing the input text is constructed, and clustering algorithms are used to identify different thematic structures.
- **Chunk Compression**: Each thematic chunk is processed independently, using a pre-trained summarization model to compress and retain key information.
- **Result Merging**: The compressed chunks are recombined in their original order to form the final compressed text.
### Experimental Results
1. **Experimental Setup**:
- Experiments were conducted on multiple tasks, including single-document question answering, multi-document question answering, summarization, few-shot learning, and information retrieval.
- The 7B parameter LLaMA model was used as the baseline model, with an input context window size of 4096.
2. **Performance Evaluation**:
- **Fluency**: Perplexity was calculated to evaluate the fluency of the generated text. Results showed that using the semantic compression method, perplexity remained low on long sequences, indicating that the generation quality was not affected.
- **Long Text Processing Capability**: In the passkey retrieval task, when the input length exceeded 4096, the baseline model's accuracy quickly dropped to zero, while using the semantic compression method, accuracy remained above 90%, even when the input length reached 30,000.
- **Multi-task Performance**: Across various NLP tasks, when the input length was in the 4k-16k range, the semantic compression method outperformed other methods in most tasks; in the 32k+ range, other methods failed due to insufficient memory, while the semantic compression method still maintained over 70% performance.
### Conclusion
The authors propose a semantic compression-based method that significantly extends the context window of LLMs, enabling them to handle text inputs 6-8 times longer. This method is not only computationally efficient but also easy to implement, allowing seamless integration with existing interpolation methods and other black-box APIs. This provides a new solution for long text processing in practical applications, reducing the cost of large language models.