mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Qinghao Ye,Haiyang Xu,Jiabo Ye,Ming Yan,Anwen Hu,Haowei Liu,Qi Qian,Ji Zhang,Fei Huang,Jingren Zhou
2023-11-09
Abstract:Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main goal of this paper is to propose a new multimodal large language model (MLLM), named mPLUG-Owl2, which aims to improve the performance of text and multimodal tasks through effective modality collaboration. Specifically, mPLUG-Owl2 addresses the issues present in existing MLLMs through the following key techniques and methods: 1. **Modular Network Design**: The model adopts a modular network structure, where the language decoder serves as a universal interface for handling different modality signals. This design allows the model to effectively manage different types of inputs. 2. **Modality Adaptive Module (MAM)**: A modality adaptive module is introduced to facilitate collaboration between visual and textual features while preserving their respective modality characteristics. This mechanism can reduce interference between modalities and enhance cross-modal information interaction. 3. **Two-Stage Training Paradigm**: The researchers proposed a training process that includes two stages: pre-training and joint instruction fine-tuning. This approach enables the visual encoder to learn visual information from low-level to high-level throughout the training process. 4. **Visual Abstractor**: To address the issues brought by increased image resolution, the model also includes a visual abstractor that can extract higher-level semantic features and significantly reduce computational complexity. Experimental results show that mPLUG-Owl2 achieves state-of-the-art performance on multiple benchmarks, including image captioning and visual question answering tasks. Additionally, the model performs excellently in zero-shot multimodal evaluations, reaching leading levels on various multimodal benchmarks. Furthermore, even in pure text tasks, mPLUG-Owl2 demonstrates outstanding performance, proving its capabilities in natural language understanding and generation. In summary, mPLUG-Owl2 effectively addresses the modality collaboration challenges in multimodal large language models through innovative design and techniques, providing new insights for the development of future multimodal foundational models.