Abstract:Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at <a class="link-external link-https" href="https://github.com/X-PLUG/mPLUG-Owl" rel="external noopener nofollow">this https URL</a>. The online demo is available at <a class="link-external link-https" href="https://www.modelscope.cn/studios/damo/mPLUG-Owl" rel="external noopener nofollow">this https URL</a>.

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

InfMLLM: A Unified Framework for Visual-Language Tasks.

Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

Unified Vision-Language Pre-Training for Image Captioning and VQA

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Towards Better Vision-Inspired Vision-Language Models

Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition

CogVLM: Visual Expert for Pretrained Language Models

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

EVLM: An Efficient Vision-Language Model for Visual Understanding

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages