Abstract:Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at <a class="link-external link-https" href="https://github.com/dvlab-research/VisionZip" rel="external noopener nofollow">this https URL</a> .
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the problem of visual token redundancy in Vision - Language Models (VLMs). Specifically:
1. **High computational cost due to long visual tokens**: In recent years, increasing the length of visual tokens to improve the performance of VLMs has become a trend. For example, in LLaVA - 1.5, the number of visual tokens is 576, while in LLaVA - NeXT, a 672x672 image will generate more than 2,880 visual tokens. This makes the number of visual tokens far exceed the number of text tokens, resulting in significant computational and memory consumption.
2. **Visual token redundancy**: By analyzing the visual tokens generated by widely - used visual encoders such as CLIP and SigLIP, the authors found that most visual tokens receive very little attention, and only a few tokens contain a large amount of information. This means that the existing visual tokens have a large amount of redundancy. These redundancies not only increase the computational burden but may also affect the model's judgment as noise, leading to performance degradation.
### Solution
To address the above problems, the authors proposed the **VisionZip** method, whose main goals are:
- **Reduce visual token redundancy**: By selecting the most informative visual tokens, reduce the number of unnecessary tokens, thereby reducing computational and memory consumption.
- **Maintain or improve model performance**: While reducing the number of visual tokens, ensure that the model's performance does not significantly decline or even improves.
### Method overview
The specific implementation of VisionZip includes the following steps:
1. **Dominant Token Selection**: Based on the attention mechanism, select those visual tokens that receive the most attention. For models with CLS tokens (such as CLIP), use the attention scores of CLS tokens to identify key visual tokens; for models without CLS tokens (such as SigLIP), calculate the average attention received by each token from other tokens in the sequence.
2. **Contextual Tokens Merging**: In order to retain small but potentially important information, merge the remaining tokens based on similarity to create more informative contextual tokens.
3. **Efficient Tuning**: Efficiently fine - tune the multimodal projection layer with minimal instruction - tuning data to enhance the alignment between the visual and language spaces, enabling the model to better adapt to the reduced number of visual tokens.
### Experimental results
The experimental results show that VisionZip significantly outperforms existing methods in multiple benchmark tests. Especially when reducing the number of visual tokens, it can still maintain or even improve the model performance. In addition, VisionZip can also significantly increase the inference speed, reduce the pre - filling time, and is suitable for multiple tasks, such as image and video understanding and multi - round conversations.
### Conclusion
By reducing visual token redundancy, VisionZip not only improves computational efficiency but also enhances the model's performance, proving that not all visual tokens are necessary in Vision - Language Models. This finding encourages the research community to pay more attention to extracting better visual features rather than simply increasing the token length.