[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster

Qizhe Zhang,Aosong Cheng,Ming Lu,Zhiyong Zhuo,Minqi Wang,Jiajun Cao,Shaobo Guo,Qi She,Shanghang Zhang
2024-12-03
Abstract:Large vision-language models (VLMs) often rely on a substantial number of visual tokens when interacting with large language models (LLMs), which has proven to be inefficient. Recent efforts have aimed to accelerate VLM inference by pruning visual tokens. Most existing methods assess the importance of visual tokens based on the text-visual cross-attentions in LLMs. In this study, we find that the cross-attentions between text and visual tokens in LLMs are inaccurate. Pruning tokens based on these inaccurate attentions leads to significant performance degradation, especially at high reduction ratios. To this end, we introduce FasterVLM, a simple yet effective training-free visual token pruning method that evaluates the importance of visual tokens more accurately by utilizing attentions between the [CLS] token and image tokens from the visual encoder. Since FasterVLM eliminates redundant visual tokens immediately after the visual encoder, ensuring they do not interact with LLMs and resulting in faster VLM inference. It is worth noting that, benefiting from the accuracy of [CLS] cross-attentions, FasterVLM can prune 95\% of visual tokens while maintaining 90\% of the performance of LLaVA-1.5-7B. We apply FasterVLM to various VLMs, including LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA, to demonstrate its effectiveness. Experimental results show that our FasterVLM maintains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing text-visual attention-based methods. Our code is available at <a class="link-external link-https" href="https://github.com/Theia-4869/FasterVLM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in visual - language models (VLMs), when visual tokens interact with large - language models (LLMs), the quantity is too large, resulting in low inference efficiency. Existing methods for accelerating VLM inference are mainly achieved by pruning visual tokens, but these methods usually evaluate the importance of visual tokens based on text - visual cross - attention in LLMs, which will lead to a significant performance degradation, especially under a high pruning ratio. Specifically, the paper points out that the text - visual cross - attention in LLMs does not accurately reflect the actual importance of visual tokens, that is, the way attention weights are assigned to each visual token does not match its importance in the task. This mismatch is mainly caused by two phenomena: attention shift and attention dispersion. Attention shift means that text attention tends to focus more on the latter part of the visual token sequence, while attention dispersion means that in LLMs, more visual tokens obtain relatively high attention scores, but the highest attention value is low. To solve these problems, the paper proposes a new method named FasterVLM, which uses the attention between the [CLS] token and the image tokens in the image encoder to more accurately evaluate the importance of visual tokens. FasterVLM immediately eliminates redundant visual tokens after the visual encoder, ensuring that they do not interact with LLMs, thereby speeding up the inference speed of VLM. Experimental results show that FasterVLM can maintain strong performance under different VLM architectures and pruning ratios, and especially under a high pruning ratio, its performance is significantly better than existing text - visual - attention - based methods.