Abstract:Recent Multimodal Large Language Models (MLLMs) exhibit impressive abilities to perceive images and follow open-ended instructions. The capabilities of MLLMs depend on two crucial factors: the model architecture to facilitate the feature alignment of visual modules and large language models; the multimodal instruction tuning datasets for human instruction following. (i) For the model architecture, most existing models introduce an external bridge module to connect vision encoders with language models, which needs an additional feature-alignment pre-training. In this work, we discover that compact pre-trained vision language models can inherently serve as ``out-of-the-box'' bridges between vision and language. Based on this, we propose Muffin framework, which directly employs pre-trained vision-language models to act as providers of visual signals. (ii) For the multimodal instruction tuning datasets, existing methods omit the complementary relationship between different datasets and simply mix datasets from different tasks. Instead, we propose UniMM-Chat dataset which explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions. We merge information describing the same image from diverse datasets and transforms it into more knowledge-intensive conversation data. Experimental results demonstrate the effectiveness of the Muffin framework and UniMM-Chat dataset. Muffin achieves state-of-the-art performance on a wide range of vision-language tasks, significantly surpassing state-of-the-art models like LLaVA and InstructBLIP. Our model and dataset are all accessible at <a class="link-external link-https" href="https://github.com/thunlp/muffin" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address two main issues: 1. **Effectiveness of Model Architecture**: Existing Multimodal Large Language Models (MLLMs) have limitations in aligning features between the visual module and the large language model. Most existing models connect the visual encoder and the language model by introducing an external bridging module, which requires additional feature alignment pre-training. The paper proposes a new framework—Muffin, which directly utilizes pre-trained Visual-Language Models (VLMs) as an "out-of-the-box" bridge, thus avoiding the extra pre-training process. 2. **Construction of Multimodal Instruction Tuning Dataset**: Existing methods for constructing multimodal instruction tuning datasets typically mix data from different tasks together, ignoring the complementary relationships between different datasets. The paper introduces a new dataset—UniMM-Chat, which merges annotation information from different datasets to generate high-quality and diverse multimodal instructions, thereby enhancing the model's generative capability and knowledge density. Specifically, the main contributions of the paper include: - Proposing a new model architecture, Muffin, which effectively connects the visual module and the large language model by directly utilizing pre-trained VLMs as a bridge. - Constructing a high-quality multimodal instruction tuning dataset, UniMM-Chat, containing over 1.1M instructions, by merging annotation information from multiple datasets to generate knowledge-intensive dialogue data. - Building a benchmark, UniMM-Bench, to evaluate the comprehensive capabilities of MLLMs in reasoning and world knowledge. - Open-sourcing Muffin, UniMM-Chat, and UniMM-Bench for community use. These contributions aim to improve the performance of multimodal large language models across various tasks and advance research in the related field.

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

InfMLLM: A Unified Framework for Visual-Language Tasks.

Efficient Multimodal Learning from Data-centric Perspective

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Efficient Multimodal Large Language Models: A Survey

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

A Survey on Benchmarks of Multimodal Large Language Models

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI