StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

Yanda Li,Chi Zhang,Gang Yu,Zhibin Wang,Bin Fu,Guosheng Lin,Chunhua Shen,Ling Chen,Yunchao Wei
2023-12-28
Abstract:The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and textual modalities effectively while comprehending human instructions. Current methodologies often rely on annotations derived from benchmark datasets to construct image-dialogue datasets for training purposes, akin to instruction tuning in LLMs. However, these datasets often exhibit domain bias, potentially constraining the generative capabilities of the models. In an effort to mitigate these limitations, we propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models to yield a diverse and controllable dataset with varied image content. Additionally, datasets can be arbitrarily scaled. This not only provides greater flexibility compared to existing methodologies but also significantly enhances several model capabilities. Our research includes comprehensive experiments conducted on various datasets. The results emphasize substantial enhancements in more than ten commonly assessed capabilities. Additionally, our model achieves state-of-the-art results across multiple widely recognized multimodal benchmarks.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issues in visual instruction tuning of multimodal large language models (LLMs), such as domain bias, limited data volume, and restricted types of generated dialogues in existing datasets. Specifically: 1. **Domain Bias**: Existing large-scale visual-text datasets (e.g., LAION and CC) usually contain noise and exhibit domain bias in image styles. For example, the COCO dataset mainly includes images from daily life, while stylized images like cartoons are rarely seen. 2. **Limited Data Volume**: Current methods rely on annotations from benchmark datasets to construct image-dialogue datasets, which limits the diversity and quantity of data, making it difficult to meet the needs of large-scale training. 3. **Restricted Types of Generated Dialogues**: Existing visual annotations may limit the types of generated dialogues. For instance, current datasets do not directly enhance the model's ability to understand jokes in images. To address these issues, the paper proposes a new data collection method that enhances visual instruction tuning by synchronously generating images and dialogues through generative models. This method leverages the powerful capabilities of generative models, combining ChatGPT and text-to-image generation models to create diverse and controllable image-dialogue datasets. These datasets not only provide greater flexibility but can also be expanded arbitrarily, significantly improving the model's performance on multiple common evaluation capabilities and achieving state-of-the-art results in several widely recognized multimodal benchmarks.