Abstract:The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal large language models (MLLMs). However, the scarcity of high-quality instruction-tuning data for vision-language tasks remains a challenge. The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data, which requires pre-annotated image captions and detection bounding boxes, suffering from understanding image details. A practical solution to this problem would be to utilize the available multimodal large language models (MLLMs) to generate instruction data for vision-language tasks. However, it's worth noting that the currently accessible MLLMs are not as powerful as their LLM counterparts, as they tend to produce inadequate responses and generate false information. As a solution for addressing the current issue, this paper proposes the Visual Instruction Generation and Correction (VIGC) framework that enables multimodal large language models to generate instruction-tuning data and progressively enhance its quality on-the-fly. Specifically, Visual Instruction Generation (VIG) guides the vision-language model to generate diverse instruction-tuning data. To ensure generation quality, Visual Instruction Correction (VIC) adopts an iterative update mechanism to correct any inaccuracies in data produced by VIG, effectively reducing the risk of hallucination. Leveraging the diverse, high-quality data generated by VIGC, we finetune mainstream models and validate data quality based on various evaluations. Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance. The models, datasets, and code are available at <a class="link-external link-https" href="https://opendatalab.github.io/VIGC" rel="external noopener nofollow">this https URL</a>.

Separable Mixture of Low-Rank Adaptation for Continual Visual Instruction Tuning

Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning

Continual Instruction Tuning for Large Multimodal Models

Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models

Multimodal Instruction Tuning with Conditional Mixture of LoRA

Learning without Forgetting for Vision-Language Models

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

LLaCA: Multimodal Large Language Continual Assistant

Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning

Dual Low-Rank Adaptation for Continual Learning with Pre-Trained Models

MoIL: Momentum Imitation Learning for Efficient Vision-Language Adaptation

CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model

Aligning Large Multi-Modal Model with Robust Instruction Tuning

Demonstrative Instruction Following in Multimodal LLMs Via Integrating Low-Rank Adaptation with Ensemble Learning

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

SwitchCIT: Switching for Continual Instruction Tuning of Large Language Models

Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer

VIGC: Visual Instruction Generation and Correction

Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

Vision-Language Instruction Tuning: A Review and Analysis