Abstract:Despite significant advancements in Multimodal Large Language Models (MLLMs) for understanding complex human intentions through cross-modal interactions, capturing intricate image details remains challenging. Previous methods integrating multiple vision encoders to enhance visual detail introduce redundancy and computational overhead. We observe that most MLLMs utilize only the last-layer feature map of the vision encoder for visual representation, neglecting the rich fine-grained information in shallow feature maps. To address this issue, we propose \modelname, a simple yet effective multi-layer feature fuser that efficiently integrates deep and shallow features from Vision Transformers (ViTs). Specifically, it leverages semantically aligned deep features as queries to dynamically extract missing details from shallow features, thus preserving semantic alignment while enriching the representation with fine-grained information. Applied to the LLaVA-1.5 model, \modelname~achieves significant improvements in visual representation and benchmark performance, providing a more flexible and lightweight solution compared to multi-encoder ensemble methods. The code and model have been released at <a class="link-external link-https" href="https://github.com/yuecao0119/MMFuser" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to address the challenges that current multi - modal large language models (MLLMs) face in understanding and expressing complex human intentions, especially the insufficiency in capturing image details through cross - modal interactions. Specifically, the paper points out: 1. **Limitations of existing methods**: - Most existing MLLMs only use the last - layer feature map of the visual encoder as the visual representation, ignoring the rich fine - grained information in the shallow - layer feature maps. - Methods integrating multiple visual encoders can enhance visual details, but introduce redundancy and computational overhead. 2. **The core of the problem**: - Although deep - layer feature maps can extract high - level semantic information, they perform poorly in fine - grained visual recognition tasks because of the lack of local details. - Although shallow - layer feature maps can capture more fine - grained details, they have a poor alignment with the text feature space, making it difficult for the model to effectively use these detail information. 3. **Proposed method**: - To overcome the above problems, the paper proposes a simple and effective multi - layer feature fusion module - MMFuser. This module enriches the fine - grained information of the visual representation while maintaining semantic alignment by using deep - layer features as queries to dynamically extract missing details from shallow - layer features. 4. **Objective**: - Improve the performance of MLLMs when processing images and videos, especially in fine - grained visual recognition tasks, such as optical character recognition (OCR) and visual positioning. ### Summary The main contribution of the paper lies in revealing that the potential of a single visual encoder in MLLMs has not been fully exploited, and proposing a new multi - layer feature fusion method MMFuser, which significantly improves the performance of the model in various multi - modal benchmark tests by dynamically combining shallow - layer and deep - layer features.

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

InfMLLM: A Unified Framework for Visual-Language Tasks.

EVLM: An Efficient Vision-Language Model for Visual Understanding

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification

HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models

Dense Connector for MLLMs

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM