Abstract:Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks. A key challenge in accelerating VLMs is storing and accessing the large Key-Value (KV) cache that encodes long visual contexts, such as images or videos. While existing KV cache compression methods are effective for Large Language Models (LLMs), directly migrating them to VLMs yields suboptimal accuracy and speedup. To bridge the gap, we propose VL-Cache, a novel KV cache compression recipe tailored for accelerating VLM inference. In this paper, we first investigate the unique sparsity pattern of VLM attention by distinguishing visual and text tokens in prefill and decoding phases. Based on these observations, we introduce a layer-adaptive sparsity-aware cache budget allocation method that effectively distributes the limited cache budget across different layers, further reducing KV cache size without compromising accuracy. Additionally, we develop a modality-aware token scoring policy to better evaluate the token importance. Empirical results on multiple benchmark datasets demonstrate that retaining only 10% of KV cache achieves accuracy comparable to that with full cache. In a speed benchmark, our method accelerates end-to-end latency of generating 100 tokens by up to 2.33x and speeds up decoding by up to 7.08x, while reducing the memory footprint of KV cache in GPU by 90%.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively compress the Key - Value (KV) cache during the reasoning process of Vision - Language Models (VLMs) to accelerate the running speed of the model and reduce GPU memory usage. Specifically, the existing KV cache compression methods are mainly for Large Language Models (LLMs), and when directly applied to VLMs, the effect is not good, resulting in a decline in accuracy and unsatisfactory acceleration effects. For this reason, the paper proposes VL - Cache, a new KV cache compression scheme specifically designed for VLMs, aiming to improve the efficiency and accuracy of the model when processing long - visual contexts (such as images or videos) by optimizing the allocation and management of the KV cache. ### Main Contributions 1. **Analysis of Attention Sparsity in VLMs**: The paper reveals the unique attention sparsity patterns of VLMs in the pre - filling and decoding stages, which are significantly different from those of LLMs. There are obvious modality boundaries in the attention matrix of VLMs, that is, there is a clear distinction between visual tokens and subsequent language tokens. 2. **Layer - Adaptive Sparsity - Aware Cache Budget Allocation**: A dynamic cache budget allocation method based on the attention sparsity of each layer during reasoning is proposed. High task - level accuracy can be maintained even with only 10% of the KV cache budget. 3. **Modality - Aware Token Scoring Strategy**: It is observed that the language - to - visual attention can reflect the importance of visual tokens, so different treatments are carried out for visual and language attention scores to better retain important tokens. ### Experimental Results - **Accuracy Evaluation**: The experimental results show that VL - Cache can achieve an accuracy close to that of the full cache with only 10% of the KV cache on multiple benchmark datasets, which is significantly better than other compression methods. - **Speed Benchmark Test**: In terms of the end - to - end latency for generating 100 tokens, VL - Cache can accelerate up to 2.33 times; in terms of decoding speed, it can accelerate up to 7.08 times, while reducing the GPU memory usage of the KV cache by 90%. ### Method Overview 1. **Sparsity - Aware Cache Budget Allocation**: - Use Post - vision Attention to calculate the sparsity of each layer. - Allocate the total KV cache budget according to the sparsity ratio of each layer to ensure that each layer can obtain an appropriate amount of cache space. 2. **Modality - Aware Token Scoring Strategy**: - Calculate the importance of tokens based on Post - vision Attention. - Select important tokens to retain in order to maximize the cache hit rate, thereby maintaining high accuracy under a limited cache budget. ### Formula Examples - **Threshold Filter**: \[ \text{ThresholdFilter}(A, p)_{ij} = \begin{cases} A_{ij} & \text{if } A_{ij} \geq p \cdot \max_j(A_{ij}) \\ 0 & \text{otherwise} \end{cases} \] where \( p \in (0, 1) \) controls the intensity of sparsification. - **Layer Sparsity Calculation**: \[ \gamma^{(l)} = \frac{\sum_{i \geq j} 1[\text{ThresholdFilter}(A^{(l)}, p)_{ij} = 0]}{| \{ A^{(l)}_{ij} : i \geq j \} |} \] - **Cache Hit Rate**: \[ \text{CacheHitRate} = \frac{|S_\psi \cap S_{\psi^*}|}{|S_{\psi^*}|} \] where \( S_{\psi^*} \) is the top \( k \) tokens under the optimal strategy, and \( S_\psi \) is the top \( k \) tokens under the current strategy. Through these methods...

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

Cross-Self KV Cache Pruning for Efficient Vision-Language Inference

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models

Unifying KV Cache Compression for Large Language Models with LeanKV

CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios

A Survey on Large Language Model Acceleration based on KV Cache Management

LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

Efficient LLM Inference with Kcache

SnapKV: LLM Knows What You are Looking for Before Generation

KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Lossless KV Cache Compression to 2%

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

VoCo-LLaMA: Towards Vision Compression with Large Language Models