LanguaShrink: Reducing Token Overhead with Psycholinguistics

Xuechen Liang,Meiling Tao,Yinghui Xia,Tianyu Shi,Jun Wang,JingSong Yang
2024-09-02
Abstract:As large language models (LLMs) improve their capabilities in handling complex tasks, the issues of computational cost and efficiency due to long prompts are becoming increasingly prominent. To accelerate model inference and reduce costs, we propose an innovative prompt compression framework called LanguaShrink. Inspired by the observation that LLM performance depends on the density and position of key information in the input prompts, LanguaShrink leverages psycholinguistic principles and the Ebbinghaus memory curve to achieve task-agnostic prompt compression. This effectively reduces prompt length while preserving essential information. We referred to the training method of OpenChat.The framework introduces part-of-speech priority compression and data distillation techniques, using smaller models to learn compression targets and employing a KL-regularized reinforcement learning strategy for training.\cite{wang2023openchat} Additionally, we adopt a chunk-based compression algorithm to achieve adjustable compression rates. We evaluate our method on multiple datasets, including LongBench, ZeroScrolls, Arxiv Articles, and a newly constructed novel test set. Experimental results show that LanguaShrink maintains semantic similarity while achieving up to 26 times compression. Compared to existing prompt compression methods, LanguaShrink improves end-to-end latency by 1.43 times.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the computational cost and efficiency issues faced by large language models (LLMs) when processing long - text prompts. Specifically: 1. **Computational cost and efficiency issues**: As the capabilities of large language models continue to improve, the length of prompts required to handle complex tasks also keeps increasing, which leads to significant computational costs and time expenditures. Especially when processing long texts, these models require more memory and computational resources, increasing the economic burden and technical challenges. 2. **Redundant information removal**: Although some existing prompt compression methods can reduce the prompt length, they may lose critical information when removing redundant information, affecting the performance and accuracy of the model. 3. **Lack of universality and portability**: Current task - specific prompt compression methods lack universality and portability and cannot adapt well to a variety of different application scenarios. To solve these problems, the paper proposes an innovative prompt compression framework - LanguaShrink. The main goal of this framework is to accelerate model inference and reduce costs by compressing the prompt length while retaining the core information to ensure that the model performance is not affected. ### Main features of LanguaShrink - **Psycholinguistics - based compression**: LanguaShrink utilizes psycholinguistic principles and the Ebbinghaus forgetting curve to achieve task - independent prompt compression, ensuring that the compressed prompt still contains important information. - **Part - of - speech - priority compression**: By designing specific prompts, the model is guided to prioritize and compress the text according to parts of speech, thus more efficiently retaining the core information. - **Data distillation technology**: Use a small - scale model to learn the compression target and combine the KL - regularized reinforcement learning strategy for training to optimize the compression effect. - **Block compression algorithm**: Divide the text into multiple blocks, evaluate the relevance, importance, and perplexity of each block, and adjust the retention priority to achieve an adjustable compression rate. Experimental results show that LanguaShrink achieves a compression rate of up to 26 times on multiple datasets while maintaining performance comparable to the original prompt, significantly improving semantic similarity and reducing end - to - end latency.