Abstract:Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Code and data are available at <a class="link-external link-https" href="https://github.com/FuxiaoLiu/LRV-Instruction" rel="external noopener nofollow">this https URL</a>.

Vision-Language Instruction Tuning: A Review and Analysis

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Aligning Large Multi-Modal Model with Robust Instruction Tuning

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

MLAN: Language-Based Instruction Tuning Improves Zero-Shot Generalization of Multimodal Large Language Models

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

M$^3$IT: A Large-Scale Dataset Towards Multi-Modal Multilingual Instruction Tuning

Instruction Tuning for Large Language Models: A Survey

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

Rethinking Overlooked Aspects in Vision-Language Models

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning

SVIT: Scaling up Visual Instruction Tuning

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning

VIGC: Visual Instruction Generation and Correction

Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning

Personalized Visual Instruction Tuning

Your Vision-Language Model Itself Is a Strong Filter: Towards High-Quality Instruction Tuning with Data Selection