Abstract:Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-solving capabilities, but there is limited research focusing on their ability to generate data by converting unlabeled images into visual instruction tuning data. To this end, this paper is the first to explore the potential of empowering MLLM to generate data rather than prompting GPT-4. We introduce Genixer, a holistic data generation pipeline consisting of four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLMs, and (iv) data generation and filtering. Additionally, we outline two modes of data generation: task-agnostic and task-specific, enabling controllable output. We demonstrate that a synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. Additionally, the grounding MLLM Shikra, when trained with a REC-like synthetic dataset, shows improvements on 7 out of 8 REC datasets. Through experiments and synthetic data analysis, our findings are: (1) current MLLMs can serve as robust data generators without assistance from GPT-4V; (2) MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data; (3) synthetic datasets enhance performance across various multimodal benchmarks and help mitigate model hallucinations. The data, code, and models can be found at

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficient ability of Multimodal Large Language Models (MLLMs) in generating visual instruction data. Specifically, although current MLLMs have demonstrated excellent problem - solving capabilities, research on converting unlabeled images into visual - instruction - tuning data is rather limited. For this reason, this paper, for the first time, explores the potential of using MLLMs to generate data instead of relying on GPT - 4 to generate visual instruction data. The paper proposes a comprehensive data - generation pipeline named Genixer, aiming to achieve this goal through four key steps: 1. **Instruction Data Collection**: Collect existing visual - language task data as a source for generating task - specific data. 2. **Instruction Template Design**: Carefully design two - level instruction templates to achieve controllable data generation. 3. **Enhance MLLMs**: Select two representative MLLMs (LLaVA1.5 and Shikra), and train them to generate instruction data. 4. **Data Generation and Filtering**: Generate data and use an automatic data - filtering pipeline to remove incorrect data samples. In addition, the paper also explores two data - generation modes: task - independent data generation and task - specific data generation. Through experiments and synthetic data analysis, the paper has reached the following main findings: 1. Current MLLMs can generate visual - instruction data of comparable quality without the assistance of GPT - 4V. 2. Trained MLLMs are superior to GPT - 4V in generating complex - instruction - tuning data. 3. The synthetic dataset significantly improves the performance of MLLMs on multiple multimodal benchmarks and helps to mitigate the model's hallucination phenomenon. In summary, the main contributions of the paper include: - Proposing a comprehensive data - generation pipeline Genixer that can generate diverse visual - instruction - tuning data from unlabeled images. - Contributing two open - source data - generation models, Genixer Land and Genixer S, to promote data creation in the multimodal field. - Contributing two high - quality multimodal datasets, Genixer - 915K and Genixer - 350K, for improving the performance of other MLLMs on multiple benchmarks.

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

Generative Visual Instruction Tuning

MLLM-DataEngine: An Iterative Refinement Approach for MLLM

LLMGA: Multimodal Large Language Model based Generation Assistant

TarGEN: Targeted Data Generation with Large Language Models

VIGC: Visual Instruction Generation and Correction

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

Regurgitative Training: The Value of Real Data in Training Large Language Models

A Survey on Multimodal Large Language Models

GenQA: Generating Millions of Instructions from a Handful of Prompts

GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

Evaluating Language Models as Synthetic Data Generators

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?