LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

Huiqiang Jiang,Qianhui Wu,Xufang Luo,Dongsheng Li,Chin-Yew Lin,Yuqing Yang,Lili Qiu
2024-08-12
Abstract:In long context scenarios, large language models (LLMs) face three main challenges: higher computational cost, performance reduction, and position bias. Research indicates that LLM performance hinges on the density and position of key information in the input prompt. Inspired by these findings, we propose LongLLMLingua for prompt compression towards improving LLMs' perception of the key information to simultaneously address the three challenges. Our extensive evaluation across various long context scenarios demonstrates that LongLLMLingua not only enhances performance but also significantly reduces costs and latency. For instance, in the NaturalQuestions benchmark, LongLLMLingua boosts performance by up to 21.4% with around 4x fewer tokens in GPT-3.5-Turbo, leading to substantial cost savings. It achieves a 94.0% cost reduction in the LooGLE benchmark. Moreover, when compressing prompts of about 10k tokens at ratios of 2x-6x, LongLLMLingua can accelerate end-to-end latency by 1.4x-2.6x. Our code is available at <a class="link-external link-https" href="https://aka.ms/LongLLMLingua" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address three main challenges faced when using large - language models (LLMs) in long - text scenarios: 1. **Higher computational costs**: This includes financial and latency costs. Long - text processing requires more computational resources, leading to increased costs. 2. **Performance degradation**: Irrelevant and redundant information contained in long texts can undermine the performance of LLMs. 3. **Position bias**: LLMs exhibit position bias when processing long texts, that is, the position of key information significantly affects the performance of the model, especially the information in the middle part of the text is easily ignored. To meet these challenges, the authors propose **LongLLMLingua**, which improves the LLMs' ability to perceive key information through prompt compression, thereby simultaneously solving the above three problems. Specifically, the main contributions of LongLLMLingua include: 1. **Proposing a problem - aware coarse - to - fine - grained compression method** to increase the density of key information in the prompt. 2. **Introducing a document re - ordering strategy** to minimize the position bias of LLMs. 3. **Establishing a dynamic compression ratio** to achieve precise control between coarse - grained and fine - grained compression. 4. **Proposing a post - compression sub - sequence recovery strategy** to improve the integrity of key information. 5. **Conducting extensive evaluations in multiple benchmark tests**, including NaturalQuestions, LongBench, ZeroSCROLLS, MuSicQue and LooGLE, covering various long - text scenarios. The experimental results show that the compressed prompts of LongLLMLingua are superior to the original prompts in terms of performance, cost - efficiency and system latency. Through these methods, LongLLMLingua not only improves the performance of LLMs in long - text scenarios, but also significantly reduces computational costs and latency.