Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

Henry Hengyuan Zhao,Pan Zhou,Mike Zheng Shou
2024-05-19
Abstract:Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-solving capabilities, but there is limited research focusing on their ability to generate data by converting unlabeled images into visual instruction tuning data. To this end, this paper is the first to explore the potential of empowering MLLM to generate data rather than prompting GPT-4. We introduce Genixer, a holistic data generation pipeline consisting of four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLMs, and (iv) data generation and filtering. Additionally, we outline two modes of data generation: task-agnostic and task-specific, enabling controllable output. We demonstrate that a synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. Additionally, the grounding MLLM Shikra, when trained with a REC-like synthetic dataset, shows improvements on 7 out of 8 REC datasets. Through experiments and synthetic data analysis, our findings are: (1) current MLLMs can serve as robust data generators without assistance from GPT-4V; (2) MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data; (3) synthetic datasets enhance performance across various multimodal benchmarks and help mitigate model hallucinations. The data, code, and models can be found at
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficient ability of Multimodal Large Language Models (MLLMs) in generating visual instruction data. Specifically, although current MLLMs have demonstrated excellent problem - solving capabilities, research on converting unlabeled images into visual - instruction - tuning data is rather limited. For this reason, this paper, for the first time, explores the potential of using MLLMs to generate data instead of relying on GPT - 4 to generate visual instruction data. The paper proposes a comprehensive data - generation pipeline named Genixer, aiming to achieve this goal through four key steps: 1. **Instruction Data Collection**: Collect existing visual - language task data as a source for generating task - specific data. 2. **Instruction Template Design**: Carefully design two - level instruction templates to achieve controllable data generation. 3. **Enhance MLLMs**: Select two representative MLLMs (LLaVA1.5 and Shikra), and train them to generate instruction data. 4. **Data Generation and Filtering**: Generate data and use an automatic data - filtering pipeline to remove incorrect data samples. In addition, the paper also explores two data - generation modes: task - independent data generation and task - specific data generation. Through experiments and synthetic data analysis, the paper has reached the following main findings: 1. Current MLLMs can generate visual - instruction data of comparable quality without the assistance of GPT - 4V. 2. Trained MLLMs are superior to GPT - 4V in generating complex - instruction - tuning data. 3. The synthetic dataset significantly improves the performance of MLLMs on multiple multimodal benchmarks and helps to mitigate the model's hallucination phenomenon. In summary, the main contributions of the paper include: - Proposing a comprehensive data - generation pipeline Genixer that can generate diverse visual - instruction - tuning data from unlabeled images. - Contributing two open - source data - generation models, Genixer Land and Genixer S, to promote data creation in the multimodal field. - Contributing two high - quality multimodal datasets, Genixer - 915K and Genixer - 350K, for improving the performance of other MLLMs on multiple benchmarks.