Abstract:Large vision-language models (VLMs) often rely on a substantial number of visual tokens when interacting with large language models (LLMs), which has proven to be inefficient. Recent efforts have aimed to accelerate VLM inference by pruning visual tokens. Most existing methods assess the importance of visual tokens based on the text-visual cross-attentions in LLMs. In this study, we find that the cross-attentions between text and visual tokens in LLMs are inaccurate. Pruning tokens based on these inaccurate attentions leads to significant performance degradation, especially at high reduction ratios. To this end, we introduce FasterVLM, a simple yet effective training-free visual token pruning method that evaluates the importance of visual tokens more accurately by utilizing attentions between the [CLS] token and image tokens from the visual encoder. Since FasterVLM eliminates redundant visual tokens immediately after the visual encoder, ensuring they do not interact with LLMs and resulting in faster VLM inference. It is worth noting that, benefiting from the accuracy of [CLS] cross-attentions, FasterVLM can prune 95\% of visual tokens while maintaining 90\% of the performance of LLaVA-1.5-7B. We apply FasterVLM to various VLMs, including LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA, to demonstrate its effectiveness. Experimental results show that our FasterVLM maintains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing text-visual attention-based methods. Our code is available at <a class="link-external link-https" href="https://github.com/Theia-4869/FasterVLM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in visual - language models (VLMs), when visual tokens interact with large - language models (LLMs), the quantity is too large, resulting in low inference efficiency. Existing methods for accelerating VLM inference are mainly achieved by pruning visual tokens, but these methods usually evaluate the importance of visual tokens based on text - visual cross - attention in LLMs, which will lead to a significant performance degradation, especially under a high pruning ratio. Specifically, the paper points out that the text - visual cross - attention in LLMs does not accurately reflect the actual importance of visual tokens, that is, the way attention weights are assigned to each visual token does not match its importance in the task. This mismatch is mainly caused by two phenomena: attention shift and attention dispersion. Attention shift means that text attention tends to focus more on the latter part of the visual token sequence, while attention dispersion means that in LLMs, more visual tokens obtain relatively high attention scores, but the highest attention value is low. To solve these problems, the paper proposes a new method named FasterVLM, which uses the attention between the [CLS] token and the image tokens in the image encoder to more accurately evaluate the importance of visual tokens. FasterVLM immediately eliminates redundant visual tokens after the visual encoder, ensuring that they do not interact with LLMs, thereby speeding up the inference speed of VLM. Experimental results show that FasterVLM can maintain strong performance under different VLM architectures and pruning ratios, and especially under a high pruning ratio, its performance is significantly better than existing text - visual - attention - based methods.

[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

FoPru: Focal Pruning for Efficient Large Vision-Language Models

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for accelerating Large VLMs

SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models

Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs

freePruner: A Training-free Approach for Large Multimodal Model Acceleration

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

A-VL: Adaptive Attention for Large Vision-Language Models

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation