StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

Yanda Li,Chi Zhang,Gang Yu,Zhibin Wang,Bin Fu,Guosheng Lin,Chunhua Shen,Ling Chen,Yunchao Wei

2023-12-28

Abstract:The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and textual modalities effectively while comprehending human instructions. Current methodologies often rely on annotations derived from benchmark datasets to construct image-dialogue datasets for training purposes, akin to instruction tuning in LLMs. However, these datasets often exhibit domain bias, potentially constraining the generative capabilities of the models. In an effort to mitigate these limitations, we propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models to yield a diverse and controllable dataset with varied image content. Additionally, datasets can be arbitrarily scaled. This not only provides greater flexibility compared to existing methodologies but also significantly enhances several model capabilities. Our research includes comprehensive experiments conducted on various datasets. The results emphasize substantial enhancements in more than ten commonly assessed capabilities. Additionally, our model achieves state-of-the-art results across multiple widely recognized multimodal benchmarks.

Computer Vision and Pattern Recognition,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the issues in visual instruction tuning of multimodal large language models (LLMs), such as domain bias, limited data volume, and restricted types of generated dialogues in existing datasets. Specifically: 1. **Domain Bias**: Existing large-scale visual-text datasets (e.g., LAION and CC) usually contain noise and exhibit domain bias in image styles. For example, the COCO dataset mainly includes images from daily life, while stylized images like cartoons are rarely seen. 2. **Limited Data Volume**: Current methods rely on annotations from benchmark datasets to construct image-dialogue datasets, which limits the diversity and quantity of data, making it difficult to meet the needs of large-scale training. 3. **Restricted Types of Generated Dialogues**: Existing visual annotations may limit the types of generated dialogues. For instance, current datasets do not directly enhance the model's ability to understand jokes in images. To address these issues, the paper proposes a new data collection method that enhances visual instruction tuning by synchronously generating images and dialogues through generative models. This method leverages the powerful capabilities of generative models, combining ChatGPT and text-to-image generation models to create diverse and controllable image-dialogue datasets. These datasets not only provide greater flexibility but can also be expanded arbitrarily, significantly improving the model's performance on multiple common evaluation capabilities and achieving state-of-the-art results in several widely recognized multimodal benchmarks.

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

Generative Visual Instruction Tuning

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

VIGC: Visual Instruction Generation and Correction

Personalized Visual Instruction Tuning

Improving Visual Storytelling with Multimodal Large Language Models

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Vision-Language Instruction Tuning: A Review and Analysis

Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

TouchStone: Evaluating Vision-Language Models by Language Models