Abstract:In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens and require additional training data. Differently, we propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs. Concretely, given that visual tokens complement text tokens in VLMs for linguistic reasoning, we select visual-relevant text tokens to rate the significance of vision tokens within the self-attention matrix extracted from the VLMs. Then we progressively prune irrelevant tokens. To maximize sparsity while retaining essential information, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks. In particular, LLaVA equipped with SparseVLM reduces 61% to 67% FLOPs with a compression ratio of 78% while maintaining 93% of the accuracy. Our code is available at <a class="link-external link-https" href="https://github.com/Gumpest/SparseVLMs" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the significant computational overhead caused by visual tokens in Vision-Language Models (VLMs). Although visual tokens generally have lower information density compared to text tokens, their processing in VLMs still consumes substantial computational resources, especially when dealing with high-resolution images and multi-frame videos. Existing methods mostly involve training additional networks to prune redundant visual tokens, which requires extra training data. However, these methods often overlook the importance of text tokens in guiding the significance of visual tokens. To solve this issue, the authors propose an efficient, training-free token optimization mechanism called SparseVLM. SparseVLM is implemented through the following steps: 1. **Selection of Visually Relevant Text Tokens**: Utilizing cross-attention in the self-attention matrix to identify text tokens that are strongly correlated with visual signals, serving as "scorers" for evaluating the importance of visual tokens. 2. **Evaluation of Visual Token Importance**: Assessing the importance of each visual token based on the selected text scorers and progressively pruning unimportant tokens. 3. **Adaptive Sparsification Ratio**: Introducing a rank-based strategy to adaptively determine the sparsification ratio at each layer, maximizing the degree of sparsification while retaining critical information. 4. **Token Recovery**: Compressing the pruned tokens into a more compact representation to reduce information loss. Experimental results show that SparseVLM significantly improves the efficiency of VLMs in various image and video understanding tasks while maintaining high accuracy. For instance, LLaVA equipped with SparseVLM reduces FLOPs by 61% to 67%, achieves a compression ratio of 78%, and maintains 93% accuracy.

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster

Rethinking Pruning for Vision-Language Models: Strategies for Effective Sparsity and Performance Restoration

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Inference Optimal VLMs Need Only One Visual Token but Larger Models

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models

FoPru: Focal Pruning for Efficient Large Vision-Language Models

VoCo-LLaMA: Towards Vision Compression with Large Language Models

PruneVid: Visual Token Pruning for Efficient Video Large Language Models

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM