Abstract:The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal large language models (MLLMs). However, the scarcity of high-quality instruction-tuning data for vision-language tasks remains a challenge. The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data, which requires pre-annotated image captions and detection bounding boxes, suffering from understanding image details. A practical solution to this problem would be to utilize the available multimodal large language models (MLLMs) to generate instruction data for vision-language tasks. However, it's worth noting that the currently accessible MLLMs are not as powerful as their LLM counterparts, as they tend to produce inadequate responses and generate false information. As a solution for addressing the current issue, this paper proposes the Visual Instruction Generation and Correction (VIGC) framework that enables multimodal large language models to generate instruction-tuning data and progressively enhance its quality on-the-fly. Specifically, Visual Instruction Generation (VIG) guides the vision-language model to generate diverse instruction-tuning data. To ensure generation quality, Visual Instruction Correction (VIC) adopts an iterative update mechanism to correct any inaccuracies in data produced by VIG, effectively reducing the risk of hallucination. Leveraging the diverse, high-quality data generated by VIGC, we finetune mainstream models and validate data quality based on various evaluations. Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance. The models, datasets, and code are available at <a class="link-external link-https" href="https://opendatalab.github.io/VIGC" rel="external noopener nofollow">this https URL</a>.

Large-Scale Visual Language Model Boosted by Contrast Domain Adaptation for Intelligent Industrial Visual Monitoring

An Intelligent Industrial Visual Monitoring and Maintenance Framework Empowered by Large-Scale Visual and Language Models

Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection

VLLaVO: Mitigating Visual Gap through LLMs

VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

Industrial Language-Image Dataset (ILID): Adapting Vision Foundation Models for Industrial Settings

Enhancing Advanced Visual Reasoning Ability of Large Language Models

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection

InfMLLM: A Unified Framework for Visual-Language Tasks.

RelationVLM: Making Large Vision-Language Models Understand Visual Relations

VSLLaVA: a pipeline of large multimodal foundation model for industrial vibration signal analysis

CogVLM2: Visual Language Models for Image and Video Understanding

On Large Visual Language Models for Medical Imaging Analysis: An Empirical Study

COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS

OphGLM: An ophthalmology large language-and-vision assistant

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

VIGC: Visual Instruction Generation and Correction