Abstract:Large Language Models (LLMs) have ignited an innovative surge of AI applications, marking a new era of exciting possibilities equipped with extended context windows. However, hosting these models is cost-prohibitive mainly due to the extensive memory consumption of KV Cache involving long-context modeling. Despite several works proposing to evict unnecessary tokens from the KV Cache, most of them rely on the biased local statistics of accumulated attention scores and report performance using unconvincing metric like perplexity on inadequate short-text evaluation. In this paper, we propose NACL, a general framework for long-context KV cache eviction that achieves more optimal and efficient eviction in a single operation during the encoding phase. Due to NACL's efficiency, we combine more accurate attention score statistics in PROXY TOKENS EVICTION with the diversified random eviction strategy of RANDOM EVICTION, aiming to alleviate the issue of attention bias and enhance the robustness in maintaining pivotal tokens for long-context modeling tasks. Notably, our method significantly improves the performance on short- and long-text tasks by 80% and 76% respectively, reducing KV Cache by up to 50% with over 95% performance maintenance. The code is available at <a class="link-external link-https" href="https://github.com/PaddlePaddle/Research/tree/master/NLP/ACL2024-NACL" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve the problem of excessive memory consumption caused by the KV cache mechanism when large - language models (LLMs) perform long - context modeling. Specifically, when LLMs process long - text inputs, a large amount of memory is required to store the KV cache, which makes it costly and infeasible to deploy these models on fixed - memory hardware. For example, a 7 - billion - parameter model will generate 64GB of KV cache when the input batch size is 4 and the sequence length is 32k, which is 4.7 times larger than the model weights themselves. To alleviate this problem, existing research has mainly explored the sparsity in Transformer attention blocks to remove unnecessary tokens from the KV cache. However, most of these methods rely on local statistics of accumulated attention scores, and this strategy may be biased in long - context tasks, leading to inaccurate performance evaluation. Therefore, the paper proposes NACL (Not Another Cache LLM), which is a general and effective KV cache eviction framework, aiming to achieve more efficient and optimized eviction through a one - time globally optimal eviction operation in the encoding stage. NACL combines the proxy - token - based eviction strategy (PROXY - TOKENS EVICTION) and the random eviction strategy (RANDOM EVICTION) to mitigate the attention bias problem and enhance the model's ability to maintain key tokens in long - context modeling tasks. The main contributions of the paper include: 1. **Proposing a new KV cache eviction framework**: NACL completes the globally optimal eviction operation at one time in the encoding stage instead of gradually evicting in the generation stage, thus improving efficiency. 2. **Combining multiple eviction strategies**: By combining the proxy - token - based eviction strategy and the random eviction strategy, NACL can more accurately identify and retain key tokens while reducing performance degradation due to attention bias. 3. **Significant performance improvement**: Experimental results show that the performance of NACL on short - text and long - text tasks is improved by 80% and 76% respectively, while reducing the KV cache by up to 5 times and maintaining more than 95% of the performance. In conclusion, NACL aims to enable large - language models to better handle long - context tasks under a limited memory budget through efficient KV cache management.

NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Efficient LLM Inference with Kcache

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

In-context KV-Cache Eviction for LLMs via Attention-Gate

ThinK: Thinner Key Cache by Query-Driven Pruning

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

CORM: Cache Optimization with Recent Message for Large Language Model Inference

Lossless KV Cache Compression to 2%

Unifying KV Cache Compression for Large Language Models with LeanKV

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Anchor Attention, Small Cache: Code Generation with Large Language Models

ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

QAQ: Quality Adaptive Quantization for LLM KV Cache