Abstract:Long context inference presents challenges at the system level with increased compute and memory requirements, as well as from an accuracy perspective in being able to reason over long contexts. Recently, several methods have been proposed to compress the prompt to reduce the context length. However, there has been little work on comparing the different proposed methods across different tasks through a standardized analysis. This has led to conflicting results. To address this, here we perform a comprehensive characterization and evaluation of different prompt compression methods. In particular, we analyze extractive compression, summarization-based abstractive compression, and token pruning methods. Surprisingly, we find that extractive compression often outperforms all the other approaches, and enables up to 10x compression with minimal accuracy degradation. Interestingly, we also find that despite several recent claims, token pruning methods often lag behind extractive compression. We only found marginal improvements on summarization tasks.

What problem does this paper attempt to address?

The paper primarily focuses on the challenges in long-text reasoning, particularly the increased computational and memory demands when handling large amounts of text data, and the decline in model reasoning capabilities over long sequences. To address these issues, the study proposes a comprehensive evaluation and comparison of different prompt compression methods. The core objectives of the paper can be summarized as follows: 1. **Standardized Analysis**: Current research on prompt compression methods lacks a unified standard for comparison, leading to conflicting results between different methods. Therefore, the paper aims to compare different prompt compression techniques through a standardized analytical approach. 2. **Comprehensive Evaluation**: Specifically, the paper conducts a comprehensive evaluation of several methods, including extractive compression, summarization-based abstractive compression, and token pruning. 3. **Performance Comparison**: The study finds that extractive compression generally outperforms other methods, achieving up to 10 times compression rates while maintaining minimal accuracy loss. Additionally, although some recent studies claim that token pruning methods perform well, experiments in the paper indicate that these methods often lag behind extractive compression methods. 4. **Application Scenarios**: The paper also explores the effectiveness of these compression methods in tasks such as single-document question answering, multi-document question answering, and text summarization. Through this work, the paper provides valuable references for selecting appropriate prompt compression methods in practical applications. Additionally, the study discusses related work, including the development of long-context language models, retrieval-augmented generation techniques, and the classification of existing prompt compression methods.

Characterizing Prompt Compression Methods for Long Context Inference

Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference

Perception Compressor:A training-free prompt compression method in long context scenarios

Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles

Prompt-SAW: Leveraging Relation-Aware Graphs for Textual Prompt Compression

Prompt Compression for Large Language Models: A Survey

500xCompressor: Generalized Prompt Compression for Large Language Models

From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Learning to Compress Prompt in Natural Language Formats

Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability

Discrete Prompt Compression With Reinforcement Learning

TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning

Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt

Parse Trees Guided LLM Prompt Compression

Adapting LLMs for Efficient Context Processing through Soft Prompt Compression

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

Learning to Compress Prompts with Gist Tokens

LanguaShrink: Reducing Token Overhead with Psycholinguistics