Abstract:The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the $L_2$ and the attention scores over cached KV pairs, where a low $L_2$ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the $L_2$ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy. Moreover, without relying on the attention scores, this approach remains compatible with FlashAttention, enabling broader applicability.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the high decoding latency and hardware limitations caused by the large - scale memory occupation of Key - Value (KV) caches when large - language models (LLMs) handle long - context. Specifically, as the context length increases, the size of the KV cache will increase significantly. This not only increases the memory usage but also leads to frequently reading a large amount of data from high - bandwidth memory (HBM) to streaming multiprocessors (SM) during the decoding process, thereby reducing the efficiency of the model and the possibility of practical deployment. To solve this problem, existing methods usually involve fine - tuning the model to learn compression strategies or using attention scores to reduce the sequence length. However, these methods often require complex algorithms or a large amount of computational overhead, and some methods are incompatible with FlashAttention in modern LLMs inference systems, limiting their practicality. By analyzing the attention distribution in the Transformer - based decoder model, the authors of this paper discovered a simple and effective KV cache compression strategy. They observed that in most layers, the attention distribution pattern remains consistent and there is an obvious correlation: that is, there is a high correlation between the L2 norm of the key embeddings and their corresponding attention scores. Specifically, key embeddings with a low L2 norm usually lead to high attention scores during the decoding process. Based on this finding, the authors proposed a KV cache compression method based on the L2 norm of key embeddings. This method can reduce the size of the KV cache by more than 50% without losing accuracy and is compatible with FlashAttention, increasing the scope of application of the method. In conclusion, this paper aims to effectively solve the memory and performance challenges faced by large - language models when handling long - context by proposing an L2 - norm - based KV cache compression strategy.

A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios

Lossless KV Cache Compression to 2%

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

Unifying KV Cache Compression for Large Language Models with LeanKV

KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

Effectively Compress KV Heads for LLM

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

Eigen Attention: Attention in Low-Rank Space for KV Cache Compression

SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models

SnapKV: LLM Knows What You are Looking for Before Generation

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

Residual vector quantization for KV cache compression in large language model

HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing

A Method for Building Large Language Models with Predefined KV Cache Capacity

MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache