Abstract:Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at <a class="link-external link-https" href="https://github.com/dvlab-research/VisionZip" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of visual token redundancy in Vision - Language Models (VLMs). Specifically: 1. **High computational cost due to long visual tokens**: In recent years, increasing the length of visual tokens to improve the performance of VLMs has become a trend. For example, in LLaVA - 1.5, the number of visual tokens is 576, while in LLaVA - NeXT, a 672x672 image will generate more than 2,880 visual tokens. This makes the number of visual tokens far exceed the number of text tokens, resulting in significant computational and memory consumption. 2. **Visual token redundancy**: By analyzing the visual tokens generated by widely - used visual encoders such as CLIP and SigLIP, the authors found that most visual tokens receive very little attention, and only a few tokens contain a large amount of information. This means that the existing visual tokens have a large amount of redundancy. These redundancies not only increase the computational burden but may also affect the model's judgment as noise, leading to performance degradation. ### Solution To address the above problems, the authors proposed the **VisionZip** method, whose main goals are: - **Reduce visual token redundancy**: By selecting the most informative visual tokens, reduce the number of unnecessary tokens, thereby reducing computational and memory consumption. - **Maintain or improve model performance**: While reducing the number of visual tokens, ensure that the model's performance does not significantly decline or even improves. ### Method overview The specific implementation of VisionZip includes the following steps: 1. **Dominant Token Selection**: Based on the attention mechanism, select those visual tokens that receive the most attention. For models with CLS tokens (such as CLIP), use the attention scores of CLS tokens to identify key visual tokens; for models without CLS tokens (such as SigLIP), calculate the average attention received by each token from other tokens in the sequence. 2. **Contextual Tokens Merging**: In order to retain small but potentially important information, merge the remaining tokens based on similarity to create more informative contextual tokens. 3. **Efficient Tuning**: Efficiently fine - tune the multimodal projection layer with minimal instruction - tuning data to enhance the alignment between the visual and language spaces, enabling the model to better adapt to the reduced number of visual tokens. ### Experimental results The experimental results show that VisionZip significantly outperforms existing methods in multiple benchmark tests. Especially when reducing the number of visual tokens, it can still maintain or even improve the model performance. In addition, VisionZip can also significantly increase the inference speed, reduce the pre - filling time, and is suitable for multiple tasks, such as image and video understanding and multi - round conversations. ### Conclusion By reducing visual token redundancy, VisionZip not only improves computational efficiency but also enhances the model's performance, proving that not all visual tokens are necessary in Vision - Language Models. This finding encourages the research community to pay more attention to extracting better visual features rather than simply increasing the token length.

VisionZip: Longer is Better but Not Necessary in Vision Language Models

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Make A Long Image Short: Adaptive Token Length for Vision Transformers

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

Efficient Large Multi-modal Models via Visual Context Compression

Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers

SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models

EVLM: An Efficient Vision-Language Model for Visual Understanding

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

High Efficiency Image Compression for Large Visual-Language Models

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Towards Better Vision-Inspired Vision-Language Models

Inference Optimal VLMs Need Only One Visual Token but Larger Models

[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster