MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

Jihao Liu,Xin Huang,Jinliang Zheng,Boxiao Liu,Jia Wang,Osamu Yoshie,Yu Liu,Hongsheng Li

2024-06-28

Abstract:This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at <a class="link-external link-https" href="https://github.com/jihaonew/MM-Instruct" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem this paper attempts to address is the lack of generalization ability in existing visual instruction datasets. These datasets are primarily focused on question-answering tasks and are difficult to adapt to a wider range of application scenarios, such as creative writing, summary generation, or image analysis. To tackle this challenge, the paper proposes a new method to construct a large-scale, high-quality visual instruction dataset called MM-Instruct. By leveraging the strong instruction-following capabilities of existing large language models (LLMs), the method generates novel visual instruction data from large-scale but traditional image description datasets. The aim is to enhance the instruction-following capabilities of large multimodal models (LMMs), enabling them to better handle diverse tasks in the real world.

MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

VIGC: Visual Instruction Generation and Correction

Aligning Large Multi-Modal Model with Robust Instruction Tuning

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

Instruction-Guided Visual Masking

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

Align^2LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

Video Instruction Tuning With Synthetic Data

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts