Abstract:Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.

What problem does this paper attempt to address?

The main goal of this paper is to propose a new multimodal large language model (MLLM), named mPLUG-Owl2, which aims to improve the performance of text and multimodal tasks through effective modality collaboration. Specifically, mPLUG-Owl2 addresses the issues present in existing MLLMs through the following key techniques and methods: 1. **Modular Network Design**: The model adopts a modular network structure, where the language decoder serves as a universal interface for handling different modality signals. This design allows the model to effectively manage different types of inputs. 2. **Modality Adaptive Module (MAM)**: A modality adaptive module is introduced to facilitate collaboration between visual and textual features while preserving their respective modality characteristics. This mechanism can reduce interference between modalities and enhance cross-modal information interaction. 3. **Two-Stage Training Paradigm**: The researchers proposed a training process that includes two stages: pre-training and joint instruction fine-tuning. This approach enables the visual encoder to learn visual information from low-level to high-level throughout the training process. 4. **Visual Abstractor**: To address the issues brought by increased image resolution, the model also includes a visual abstractor that can extract higher-level semantic features and significantly reduce computational complexity. Experimental results show that mPLUG-Owl2 achieves state-of-the-art performance on multiple benchmarks, including image captioning and visual question answering tasks. Additionally, the model performs excellently in zero-shot multimodal evaluations, reaching leading levels on various multimodal benchmarks. Furthermore, even in pure text tasks, mPLUG-Owl2 demonstrates outstanding performance, proving its capabilities in natural language understanding and generation. In summary, mPLUG-Owl2 effectively addresses the modality collaboration challenges in multimodal large language models through innovative design and techniques, providing new insights for the development of future multimodal foundational models.

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

InfMLLM: A Unified Framework for Visual-Language Tasks.

Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs for Embodied AI

Model Composition for Multimodal Large Language Models

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

PILL: Plug Into LLM with Adapter Expert and Attention Gate

A Survey on Multimodal Large Language Models

ModaVerse: Efficiently Transforming Modalities with LLMs

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

LLMs Can Evolve Continually on Modality for X-Modal Reasoning

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception