MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

Yangzhou Liu,Yue Cao,Zhangwei Gao,Weiyun Wang,Zhe Chen,Wenhai Wang,Hao Tian,Lewei Lu,Xizhou Zhu,Tong Lu,Yu Qiao,Jifeng Dai

2024-08-07

Abstract:Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision Large Language Models (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotation quality: despite existing VLLMs exhibiting strong performance, instructions generated by those advanced VLLMs may still suffer from inaccuracies, such as hallucinations. (2) Instructions and image diversity: the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs. To address these challenges, we construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains. There are four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering and Short Visual Question Answering. To construct MMInstruct, we propose an instruction generation data engine that leverages GPT-4V, GPT-3.5, and manual correction. Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at 1/6 the cost of manual construction. Through extensive experiment validation and ablation experiments, we demonstrate that MMInstruct could significantly improve the performance of VLLMs, e.g., the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks. The code and data shall be available at <a class="link-external link-https" href="https://github.com/yuecao0119/MMInstruct" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address several key issues in existing visual language instruction tuning datasets, specifically including: 1. **Instruction Annotation Quality**: Although existing Visual Large Language Models (VLLMs) exhibit strong performance, the instructions generated by these advanced models still suffer from inaccuracies, such as hallucinations. 2. **Instruction and Image Diversity**: The limitations in the types of instructions and the lack of diversity in image data in existing datasets may affect the model's ability to generate diverse outputs that are close to real-world scenarios. To address these issues, the authors constructed a high-quality and diverse visual instruction tuning dataset called MMInstruct, which contains 973K instructions from 24 domains. MMInstruct includes four types of instructions: true/false questions, multiple-choice questions, long visual question answering, and short visual question answering. To build MMInstruct, the authors proposed a semi-automatic, low-cost instruction generation engine based on GPT-4V, GPT-3.5, and human correction. Extensive experimental validation and ablation studies demonstrated that MMInstruct can significantly enhance the performance of VLLMs, achieving new state-of-the-art levels on multiple benchmarks.

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

Aligning Large Multi-Modal Model with Robust Instruction Tuning

MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning

Vision-Language Instruction Tuning: A Review and Analysis

M$^3$IT: A Large-Scale Dataset Towards Multi-Modal Multilingual Instruction Tuning

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4

SVIT: Scaling up Visual Instruction Tuning

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

Demystifying Instruction Mixing for Fine-tuning Large Language Models

BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models

Towards Robust Instruction Tuning on Multimodal Large Language Models

Instruction Tuning for Large Language Models: A Survey

VIGC: Visual Instruction Generation and Correction

MLAN: Language-Based Instruction Tuning Improves Zero-Shot Generalization of Multimodal Large Language Models