Abstract:KV cache has become a de facto technique for the inference of large language models (LLMs), where tensors of shape (layer number, head number, sequence length, feature dimension) are introduced to cache historical information for self-attention. As the size of the model and data grows, the KV cache can quickly become a bottleneck within the system in both storage and memory transfer. To address this, prior studies usually focus on the first three axes of the cache tensors for compression. This paper supplements them, focusing on the feature dimension axis, by utilizing low-rank projection matrices to transform the cache features into spaces with reduced dimensions. We begin by investigating the canonical orthogonal projection method for data compression through principal component analysis (PCA). We observe the issue with PCA projection where significant performance degradation is observed at low compression rates. To bridge the gap, we propose to directly tune the orthogonal projection matrices with a distillation objective using an elaborate Matryoshka training strategy. After training, we adaptively search for the optimal compression rates for various layers and heads given varying compression budgets. Compared to previous works, our method can easily embrace pre-trained LLMs and hold a smooth tradeoff between performance and compression rate. We empirically witness the high data efficiency of our training procedure and find that our method can sustain over 90% performance with an average KV cache compression rate of 60% (and up to 75% in certain extreme scenarios) for popular LLMs like LLaMA2-7B-base and Mistral-7B-v0.3-base.

What problem does this paper attempt to address?

### What problems does the paper attempt to solve? The paper aims to solve the problem that the KV cache (Key - Value Cache) in large - language models (LLMs) becomes a system bottleneck in terms of storage and memory transfer. Specifically: 1. **Background problems**: - As the model and data scale increase, the size of the KV cache will also grow rapidly, leading to bottlenecks in storage and memory transfer. - Existing research mainly focuses on compressing the first three dimensions (number of layers, number of heads, sequence length) of the KV - cache tensors, while ignoring the compression of the feature dimension. 2. **Research objectives**: - This paper focuses on the feature - dimension axis of the KV - cache tensors. By using a low - rank projection matrix to transform the cached features into a lower - dimensional space, more effective compression is achieved. - Solve the problem that the performance of existing methods drops significantly at high compression rates, and ensure that the model performance is maintained while compressing. 3. **Specific problems**: - How to effectively compress the feature dimension of the KV cache without significantly reducing performance? - How to adapt to the different compression requirements of different layers and heads to achieve the optimal compression effect? ### Method overview To solve the above problems, the authors propose the following methods: 1. **Initial attempt**: - Using principal component analysis (PCA) for orthogonal projection, it is found that this method performs well at low compression rates, but the performance drops sharply at high compression rates. 2. **Improvement plan**: - A knowledge - distillation - based objective function is proposed to directly adjust the orthogonal projection matrix to keep the compressed output as close as possible to the original output. - The Matryoshka training strategy is introduced. By randomly sampling different numbers of columns to build the model and ensuring that its output is close to the original output, hierarchical compression is achieved. 3. **Adaptive search**: - In the inference stage, the best compression rates for different layers and heads are adaptively searched by the greedy algorithm to meet a specific compression budget. ### Experimental results Experiments show that the MatryoshkaKV method can maintain more than 90% of the original model performance at an average compression rate of 60%, and the compression rate can be up to 75%. Especially in the supervised fine - tuning (SFT) task, this method can achieve an average accuracy of 92.47% of the baseline model using only 50% of the KV cache. In conclusion, through the innovative orthogonal projection matrix adjustment and adaptive compression strategy, the paper successfully solves the storage and transfer bottleneck problems of the KV cache in large - language models while maintaining high model performance.

MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

Effectively Compress KV Heads for LLM

CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios

Unifying KV Cache Compression for Large Language Models with LeanKV

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Palu: Compressing KV-Cache with Low-Rank Projection

Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression

KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

Lossless KV Cache Compression to 2%

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

Residual vector quantization for KV cache compression in large language model

AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations

Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance

ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference

A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression