Abstract:Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at <a class="link-external link-https" href="https://github.com/X-PLUG/mPLUG-Owl" rel="external noopener nofollow">this https URL</a>. The online demo is available at <a class="link-external link-https" href="https://www.modelscope.cn/studios/damo/mPLUG-Owl" rel="external noopener nofollow">this https URL</a>.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

OLMo: Accelerating the Science of Language Models

2 OLMo 2 Furious

Pixtral 12B

OLMoE: Open Mixture-of-Experts Language Models

Fully Open Source Moxin-7B Technical Report

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

NVLM: Open Frontier-Class Multimodal LLMs

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Vision Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models

MolLM : a unified language model for integrating biomedical text with 2D and 3D molecular representations

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Code Llama: Open Foundation Models for Code

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training