NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time

Yilong Chen,Guoxia Wang,Junyuan Shang,Shiyao Cui,Zhenyu Zhang,Tingwen Liu,Shuohuan Wang,Yu Sun,Dianhai Yu,Hua Wu
2024-08-08
Abstract:Large Language Models (LLMs) have ignited an innovative surge of AI applications, marking a new era of exciting possibilities equipped with extended context windows. However, hosting these models is cost-prohibitive mainly due to the extensive memory consumption of KV Cache involving long-context modeling. Despite several works proposing to evict unnecessary tokens from the KV Cache, most of them rely on the biased local statistics of accumulated attention scores and report performance using unconvincing metric like perplexity on inadequate short-text evaluation. In this paper, we propose NACL, a general framework for long-context KV cache eviction that achieves more optimal and efficient eviction in a single operation during the encoding phase. Due to NACL's efficiency, we combine more accurate attention score statistics in PROXY TOKENS EVICTION with the diversified random eviction strategy of RANDOM EVICTION, aiming to alleviate the issue of attention bias and enhance the robustness in maintaining pivotal tokens for long-context modeling tasks. Notably, our method significantly improves the performance on short- and long-text tasks by 80% and 76% respectively, reducing KV Cache by up to 50% with over 95% performance maintenance. The code is available at <a class="link-external link-https" href="https://github.com/PaddlePaddle/Research/tree/master/NLP/ACL2024-NACL" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
This paper attempts to solve the problem of excessive memory consumption caused by the KV cache mechanism when large - language models (LLMs) perform long - context modeling. Specifically, when LLMs process long - text inputs, a large amount of memory is required to store the KV cache, which makes it costly and infeasible to deploy these models on fixed - memory hardware. For example, a 7 - billion - parameter model will generate 64GB of KV cache when the input batch size is 4 and the sequence length is 32k, which is 4.7 times larger than the model weights themselves. To alleviate this problem, existing research has mainly explored the sparsity in Transformer attention blocks to remove unnecessary tokens from the KV cache. However, most of these methods rely on local statistics of accumulated attention scores, and this strategy may be biased in long - context tasks, leading to inaccurate performance evaluation. Therefore, the paper proposes NACL (Not Another Cache LLM), which is a general and effective KV cache eviction framework, aiming to achieve more efficient and optimized eviction through a one - time globally optimal eviction operation in the encoding stage. NACL combines the proxy - token - based eviction strategy (PROXY - TOKENS EVICTION) and the random eviction strategy (RANDOM EVICTION) to mitigate the attention bias problem and enhance the model's ability to maintain key tokens in long - context modeling tasks. The main contributions of the paper include: 1. **Proposing a new KV cache eviction framework**: NACL completes the globally optimal eviction operation at one time in the encoding stage instead of gradually evicting in the generation stage, thus improving efficiency. 2. **Combining multiple eviction strategies**: By combining the proxy - token - based eviction strategy and the random eviction strategy, NACL can more accurately identify and retain key tokens while reducing performance degradation due to attention bias. 3. **Significant performance improvement**: Experimental results show that the performance of NACL on short - text and long - text tasks is improved by 80% and 76% respectively, while reducing the KV cache by up to 5 times and maintaining more than 95% of the performance. In conclusion, NACL aims to enable large - language models to better handle long - context tasks under a limited memory budget through efficient KV cache management.