MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

Yue Cao,Yangzhou Liu,Zhe Chen,Guangchen Shi,Wenhai Wang,Danhuai Zhao,Tong Lu
2024-10-16
Abstract:Despite significant advancements in Multimodal Large Language Models (MLLMs) for understanding complex human intentions through cross-modal interactions, capturing intricate image details remains challenging. Previous methods integrating multiple vision encoders to enhance visual detail introduce redundancy and computational overhead. We observe that most MLLMs utilize only the last-layer feature map of the vision encoder for visual representation, neglecting the rich fine-grained information in shallow feature maps. To address this issue, we propose \modelname, a simple yet effective multi-layer feature fuser that efficiently integrates deep and shallow features from Vision Transformers (ViTs). Specifically, it leverages semantically aligned deep features as queries to dynamically extract missing details from shallow features, thus preserving semantic alignment while enriching the representation with fine-grained information. Applied to the LLaVA-1.5 model, \modelname~achieves significant improvements in visual representation and benchmark performance, providing a more flexible and lightweight solution compared to multi-encoder ensemble methods. The code and model have been released at <a class="link-external link-https" href="https://github.com/yuecao0119/MMFuser" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to address the challenges that current multi - modal large language models (MLLMs) face in understanding and expressing complex human intentions, especially the insufficiency in capturing image details through cross - modal interactions. Specifically, the paper points out: 1. **Limitations of existing methods**: - Most existing MLLMs only use the last - layer feature map of the visual encoder as the visual representation, ignoring the rich fine - grained information in the shallow - layer feature maps. - Methods integrating multiple visual encoders can enhance visual details, but introduce redundancy and computational overhead. 2. **The core of the problem**: - Although deep - layer feature maps can extract high - level semantic information, they perform poorly in fine - grained visual recognition tasks because of the lack of local details. - Although shallow - layer feature maps can capture more fine - grained details, they have a poor alignment with the text feature space, making it difficult for the model to effectively use these detail information. 3. **Proposed method**: - To overcome the above problems, the paper proposes a simple and effective multi - layer feature fusion module - MMFuser. This module enriches the fine - grained information of the visual representation while maintaining semantic alignment by using deep - layer features as queries to dynamically extract missing details from shallow - layer features. 4. **Objective**: - Improve the performance of MLLMs when processing images and videos, especially in fine - grained visual recognition tasks, such as optical character recognition (OCR) and visual positioning. ### Summary The main contribution of the paper lies in revealing that the potential of a single visual encoder in MLLMs has not been fully exploited, and proposing a new multi - layer feature fusion method MMFuser, which significantly improves the performance of the model in various multi - modal benchmark tests by dynamically combining shallow - layer and deep - layer features.