Characterizing Prompt Compression Methods for Long Context Inference

Siddharth Jha,Lutfi Eren Erdogan,Sehoon Kim,Kurt Keutzer,Amir Gholami
2024-07-12
Abstract:Long context inference presents challenges at the system level with increased compute and memory requirements, as well as from an accuracy perspective in being able to reason over long contexts. Recently, several methods have been proposed to compress the prompt to reduce the context length. However, there has been little work on comparing the different proposed methods across different tasks through a standardized analysis. This has led to conflicting results. To address this, here we perform a comprehensive characterization and evaluation of different prompt compression methods. In particular, we analyze extractive compression, summarization-based abstractive compression, and token pruning methods. Surprisingly, we find that extractive compression often outperforms all the other approaches, and enables up to 10x compression with minimal accuracy degradation. Interestingly, we also find that despite several recent claims, token pruning methods often lag behind extractive compression. We only found marginal improvements on summarization tasks.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper primarily focuses on the challenges in long-text reasoning, particularly the increased computational and memory demands when handling large amounts of text data, and the decline in model reasoning capabilities over long sequences. To address these issues, the study proposes a comprehensive evaluation and comparison of different prompt compression methods. The core objectives of the paper can be summarized as follows: 1. **Standardized Analysis**: Current research on prompt compression methods lacks a unified standard for comparison, leading to conflicting results between different methods. Therefore, the paper aims to compare different prompt compression techniques through a standardized analytical approach. 2. **Comprehensive Evaluation**: Specifically, the paper conducts a comprehensive evaluation of several methods, including extractive compression, summarization-based abstractive compression, and token pruning. 3. **Performance Comparison**: The study finds that extractive compression generally outperforms other methods, achieving up to 10 times compression rates while maintaining minimal accuracy loss. Additionally, although some recent studies claim that token pruning methods perform well, experiments in the paper indicate that these methods often lag behind extractive compression methods. 4. **Application Scenarios**: The paper also explores the effectiveness of these compression methods in tasks such as single-document question answering, multi-document question answering, and text summarization. Through this work, the paper provides valuable references for selecting appropriate prompt compression methods in practical applications. Additionally, the study discusses related work, including the development of long-context language models, retrieval-augmented generation techniques, and the classification of existing prompt compression methods.