Abstract:Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B->70B), and diverse architectures of MLLMs (e.g., LLaVA and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance on across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development.

What problem does this paper attempt to address?

The problem this paper attempts to address is the inadequacy of existing Multimodal Large Language Models (MLLMs) in utilizing visual signals. Although current MLLMs have made significant progress in multimodal understanding, most research and development efforts have primarily focused on the language aspect, such as using larger and higher-quality instruction datasets and larger language models. In contrast, less attention has been paid to visual signals, typically only using the final high-level features extracted by frozen visual encoders. To address this issue, the authors propose a simple, effective, and pluggable visual-language connector—Dense Connector. By leveraging multi-layer visual features, Dense Connector can significantly enhance existing MLLMs without adding extra computational overhead. Additionally, the model demonstrates zero-shot capability in video understanding by training solely on images. Specifically, Dense Connector achieves this through the following three methods: 1. **Sparse Token Integration (STI)**: Aggregates visual tokens from different specified layers and inputs them into a learnable projector along with the final visual tokens, mapping them to the text space. 2. **Sparse Channel Integration (SCI)**: Connects visual tokens from different specified layers in the feature dimension, then uses a projector to map the visual tokens to the text space and reduce the feature dimension. 3. **Dense Channel Integration (DCI)**: Further utilizes visual features from all layers in addition to the specified layer features, reducing redundancy and high dimensionality through grouped fusion. Through these methods, Dense Connector can provide more visual cues, thereby enhancing the visual perception capabilities of MLLMs. Experimental results show that Dense Connector performs excellently across multiple visual encoders, image resolutions, training dataset scales, different sizes of language models, and various MLLM architectures, achieving state-of-the-art levels in 19 image and video benchmarks.

Dense Connector for MLLMs

Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model

From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Q-MoE: Connector for MLLMs with Text-Driven Routing

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

InfMLLM: A Unified Framework for Visual-Language Tasks.

Unifying Specialized Visual Encoders for Video Language Models

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

Demonstrative Instruction Following in Multimodal LLMs Via Integrating Low-Rank Adaptation with Ensemble Learning

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Ovis: Structural Embedding Alignment for Multimodal Large Language Model

Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models