A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

Alessio Devoto,Yu Zhao,Simone Scardapane,Pasquale Minervini
2024-10-29
Abstract:The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the $L_2$ and the attention scores over cached KV pairs, where a low $L_2$ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the $L_2$ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy. Moreover, without relying on the attention scores, this approach remains compatible with FlashAttention, enabling broader applicability.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the high decoding latency and hardware limitations caused by the large - scale memory occupation of Key - Value (KV) caches when large - language models (LLMs) handle long - context. Specifically, as the context length increases, the size of the KV cache will increase significantly. This not only increases the memory usage but also leads to frequently reading a large amount of data from high - bandwidth memory (HBM) to streaming multiprocessors (SM) during the decoding process, thereby reducing the efficiency of the model and the possibility of practical deployment. To solve this problem, existing methods usually involve fine - tuning the model to learn compression strategies or using attention scores to reduce the sequence length. However, these methods often require complex algorithms or a large amount of computational overhead, and some methods are incompatible with FlashAttention in modern LLMs inference systems, limiting their practicality. By analyzing the attention distribution in the Transformer - based decoder model, the authors of this paper discovered a simple and effective KV cache compression strategy. They observed that in most layers, the attention distribution pattern remains consistent and there is an obvious correlation: that is, there is a high correlation between the L2 norm of the key embeddings and their corresponding attention scores. Specifically, key embeddings with a low L2 norm usually lead to high attention scores during the decoding process. Based on this finding, the authors proposed a KV cache compression method based on the L2 norm of key embeddings. This method can reduce the size of the KV cache by more than 50% without losing accuracy and is compatible with FlashAttention, increasing the scope of application of the method. In conclusion, this paper aims to effectively solve the memory and performance challenges faced by large - language models when handling long - context by proposing an L2 - norm - based KV cache compression strategy.