Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

Tianyu Yu,Jinyi Hu,Yuan Yao,Haoye Zhang,Yue Zhao,Chongyi Wang,Shan Wang,Yinxv Pan,Jiao Xue,Dahai Li,Zhiyuan Liu,Hai-Tao Zheng,Maosong Sun
2023-10-01
Abstract:Recent Multimodal Large Language Models (MLLMs) exhibit impressive abilities to perceive images and follow open-ended instructions. The capabilities of MLLMs depend on two crucial factors: the model architecture to facilitate the feature alignment of visual modules and large language models; the multimodal instruction tuning datasets for human instruction following. (i) For the model architecture, most existing models introduce an external bridge module to connect vision encoders with language models, which needs an additional feature-alignment pre-training. In this work, we discover that compact pre-trained vision language models can inherently serve as ``out-of-the-box'' bridges between vision and language. Based on this, we propose Muffin framework, which directly employs pre-trained vision-language models to act as providers of visual signals. (ii) For the multimodal instruction tuning datasets, existing methods omit the complementary relationship between different datasets and simply mix datasets from different tasks. Instead, we propose UniMM-Chat dataset which explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions. We merge information describing the same image from diverse datasets and transforms it into more knowledge-intensive conversation data. Experimental results demonstrate the effectiveness of the Muffin framework and UniMM-Chat dataset. Muffin achieves state-of-the-art performance on a wide range of vision-language tasks, significantly surpassing state-of-the-art models like LLaVA and InstructBLIP. Our model and dataset are all accessible at <a class="link-external link-https" href="https://github.com/thunlp/muffin" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address two main issues: 1. **Effectiveness of Model Architecture**: Existing Multimodal Large Language Models (MLLMs) have limitations in aligning features between the visual module and the large language model. Most existing models connect the visual encoder and the language model by introducing an external bridging module, which requires additional feature alignment pre-training. The paper proposes a new framework—Muffin, which directly utilizes pre-trained Visual-Language Models (VLMs) as an "out-of-the-box" bridge, thus avoiding the extra pre-training process. 2. **Construction of Multimodal Instruction Tuning Dataset**: Existing methods for constructing multimodal instruction tuning datasets typically mix data from different tasks together, ignoring the complementary relationships between different datasets. The paper introduces a new dataset—UniMM-Chat, which merges annotation information from different datasets to generate high-quality and diverse multimodal instructions, thereby enhancing the model's generative capability and knowledge density. Specifically, the main contributions of the paper include: - Proposing a new model architecture, Muffin, which effectively connects the visual module and the large language model by directly utilizing pre-trained VLMs as a bridge. - Constructing a high-quality multimodal instruction tuning dataset, UniMM-Chat, containing over 1.1M instructions, by merging annotation information from multiple datasets to generate knowledge-intensive dialogue data. - Building a benchmark, UniMM-Bench, to evaluate the comprehensive capabilities of MLLMs in reasoning and world knowledge. - Open-sourcing Muffin, UniMM-Chat, and UniMM-Bench for community use. These contributions aim to improve the performance of multimodal large language models across various tasks and advance research in the related field.