Dense Connector for MLLMs

Huanjin Yao,Wenhao Wu,Taojiannan Yang,YuXin Song,Mengxi Zhang,Haocheng Feng,Yifan Sun,Zhiheng Li,Wanli Ouyang,Jingdong Wang
2024-05-23
Abstract:Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B->70B), and diverse architectures of MLLMs (e.g., LLaVA and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance on across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is the inadequacy of existing Multimodal Large Language Models (MLLMs) in utilizing visual signals. Although current MLLMs have made significant progress in multimodal understanding, most research and development efforts have primarily focused on the language aspect, such as using larger and higher-quality instruction datasets and larger language models. In contrast, less attention has been paid to visual signals, typically only using the final high-level features extracted by frozen visual encoders. To address this issue, the authors propose a simple, effective, and pluggable visual-language connector—Dense Connector. By leveraging multi-layer visual features, Dense Connector can significantly enhance existing MLLMs without adding extra computational overhead. Additionally, the model demonstrates zero-shot capability in video understanding by training solely on images. Specifically, Dense Connector achieves this through the following three methods: 1. **Sparse Token Integration (STI)**: Aggregates visual tokens from different specified layers and inputs them into a learnable projector along with the final visual tokens, mapping them to the text space. 2. **Sparse Channel Integration (SCI)**: Connects visual tokens from different specified layers in the feature dimension, then uses a projector to map the visual tokens to the text space and reduce the feature dimension. 3. **Dense Channel Integration (DCI)**: Further utilizes visual features from all layers in addition to the specified layer features, reducing redundancy and high dimensionality through grouped fusion. Through these methods, Dense Connector can provide more visual cues, thereby enhancing the visual perception capabilities of MLLMs. Experimental results show that Dense Connector performs excellently across multiple visual encoders, image resolutions, training dataset scales, different sizes of language models, and various MLLM architectures, achieving state-of-the-art levels in 19 image and video benchmarks.