Abstract:The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase, particularly in scenarios involving high-resolution images or videos. Visual content often exhibits substantial redundancy, resulting in highly sparse attention maps within LVLMs. This sparsity can be leveraged to accelerate attention computation or compress the KV cache through various approaches. However, most studies focus on addressing only one of these bottlenecks and do not adequately support dynamic adjustment of sparsity concerning distinct layers or tasks. In this paper, we present ZipVL, an efficient inference framework designed for LVLMs that resolves both computation and memory bottlenecks through a dynamic ratio allocation strategy of important tokens. This ratio is adaptively determined based on the layer-specific distribution of attention scores, rather than fixed hyper-parameters, thereby improving efficiency for less complex tasks while maintaining high performance for more challenging ones. Then we select important tokens based on their normalized attention scores and perform attention mechanism solely on those important tokens to accelerate the prefill phase. To mitigate the memory bottleneck in the decoding phase, we employ mixed-precision quantization to the KV cache, where high-bit quantization is used for caches of important tokens, while low-bit quantization is applied to those of less importance. Our experiments demonstrate that ZipVL can accelerate the prefill phase by 2.6$\times$ and reduce GPU memory usage by 50.0%, with a minimal accuracy reduction of only 0.2% on Video-MME benchmark over LongVA-7B model, effectively enhancing the generation efficiency of LVLMs.

ZACK: Zero-Overhead LLM Inference Acceleration via Dimensionality Compression of the Key-Value Cache

Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression

Lossless KV Cache Compression to 2%

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

Unifying KV Cache Compression for Large Language Models with LeanKV

KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing

Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning

CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios

Effectively Compress KV Heads for LLM

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression