Abstract:As large language models (LLMs) improve their capabilities in handling complex tasks, the issues of computational cost and efficiency due to long prompts are becoming increasingly prominent. To accelerate model inference and reduce costs, we propose an innovative prompt compression framework called LanguaShrink. Inspired by the observation that LLM performance depends on the density and position of key information in the input prompts, LanguaShrink leverages psycholinguistic principles and the Ebbinghaus memory curve to achieve task-agnostic prompt compression. This effectively reduces prompt length while preserving essential information. We referred to the training method of OpenChat.The framework introduces part-of-speech priority compression and data distillation techniques, using smaller models to learn compression targets and employing a KL-regularized reinforcement learning strategy for training.\cite{wang2023openchat} Additionally, we adopt a chunk-based compression algorithm to achieve adjustable compression rates. We evaluate our method on multiple datasets, including LongBench, ZeroScrolls, Arxiv Articles, and a newly constructed novel test set. Experimental results show that LanguaShrink maintains semantic similarity while achieving up to 26 times compression. Compared to existing prompt compression methods, LanguaShrink improves end-to-end latency by 1.43 times.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the computational cost and efficiency issues faced by large language models (LLMs) when processing long - text prompts. Specifically: 1. **Computational cost and efficiency issues**: As the capabilities of large language models continue to improve, the length of prompts required to handle complex tasks also keeps increasing, which leads to significant computational costs and time expenditures. Especially when processing long texts, these models require more memory and computational resources, increasing the economic burden and technical challenges. 2. **Redundant information removal**: Although some existing prompt compression methods can reduce the prompt length, they may lose critical information when removing redundant information, affecting the performance and accuracy of the model. 3. **Lack of universality and portability**: Current task - specific prompt compression methods lack universality and portability and cannot adapt well to a variety of different application scenarios. To solve these problems, the paper proposes an innovative prompt compression framework - LanguaShrink. The main goal of this framework is to accelerate model inference and reduce costs by compressing the prompt length while retaining the core information to ensure that the model performance is not affected. ### Main features of LanguaShrink - **Psycholinguistics - based compression**: LanguaShrink utilizes psycholinguistic principles and the Ebbinghaus forgetting curve to achieve task - independent prompt compression, ensuring that the compressed prompt still contains important information. - **Part - of - speech - priority compression**: By designing specific prompts, the model is guided to prioritize and compress the text according to parts of speech, thus more efficiently retaining the core information. - **Data distillation technology**: Use a small - scale model to learn the compression target and combine the KL - regularized reinforcement learning strategy for training to optimize the compression effect. - **Block compression algorithm**: Divide the text into multiple blocks, evaluate the relevance, importance, and perplexity of each block, and adjust the retention priority to achieve an adjustable compression rate. Experimental results show that LanguaShrink achieves a compression rate of up to 26 times on multiple datasets while maintaining performance comparable to the original prompt, significantly improving semantic similarity and reducing end - to - end latency.

LanguaShrink: Reducing Token Overhead with Psycholinguistics

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

Prompt Compression for Large Language Models: A Survey

500xCompressor: Generalized Prompt Compression for Large Language Models

Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt

Perception Compressor:A training-free prompt compression method in long context scenarios

TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning

Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles

Learning to Compress Prompt in Natural Language Formats

Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability

Discrete Prompt Compression With Reinforcement Learning

Say More with Less: Understanding Prompt Learning Behaviors through Gist Compression

Network-aided Efficient Large Language Model Services With Denoising-inspired Prompt Compression

SelfCP: Compressing Over-Limit Prompt via the Frozen Large Language Model Itself

Parse Trees Guided LLM Prompt Compression

Efficient Prompting Methods for Large Language Models: A Survey

Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference

Adapting LLMs for Efficient Context Processing through Soft Prompt Compression