Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

Haowen Hou,Fei Ma,Binwen Bai,Xinxin Zhu,Fei Yu
2024-08-28
Abstract:Large Language Models (LLMs) have garnered widespread attention due to their remarkable performance across various tasks. However, to mitigate the issue of hallucinations, LLMs often incorporate retrieval-augmented pipeline to provide them with rich external knowledge and context. Nevertheless, challenges stem from inaccurate and coarse-grained context retrieved from the retriever. Supplying irrelevant context to the LLMs can result in poorer responses, increased inference latency, and higher costs. This paper introduces a method called Instruction-Aware Contextual Compression, which filters out less informative content, thereby accelerating and enhancing the use of LLMs. The experimental results demonstrate that Instruction-Aware Contextual Compression notably reduces memory consumption and minimizes generation latency while maintaining performance levels comparable to those achieved with the use of the full context. Specifically, we achieved a 50% reduction in context-related costs, resulting in a 5% reduction in inference memory usage and a 2.2-fold increase in inference speed, with only a minor drop of 0.047 in Rouge-1. These findings suggest that our method strikes an effective balance between efficiency and performance.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issues encountered by large language models (LLMs) when processing long texts, particularly the challenges related to retrieval-augmented generation (RAG). Specifically: 1. **Relevance Issue**: Traditional RAG methods may retrieve irrelevant documents or irrelevant information within documents, leading to a decline in model output quality, increased inference latency, and higher costs. 2. **Context Compression**: The paper proposes a method called "instruction-aware context compression," which accelerates and enhances the performance of LLMs by filtering out irrelevant content. This method can significantly reduce memory consumption while maintaining performance and lowering inference latency. Experimental results show that this method can substantially reduce context-related costs and improve inference speed while retaining most of the performance. Specifically: - Context-related costs were reduced by 50%. - Inference memory usage was reduced by 5%. - Inference speed was increased by 2.2 times. - Rouge-1 score only decreased by 0.047. These findings indicate that the method achieves an effective balance between efficiency and performance.