ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Guiming Hardy Chen,Shunian Chen,Ruifei Zhang,Junying Chen,Xiangbo Wu,Zhiyi Zhang,Zhihong Chen,Jianquan Li,Xiang Wan,Benyou Wang

2024-06-17

Abstract:Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and deployment. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data. To this end, we propose a comprehensive pipeline for generating a synthetic dataset. The key idea is to leverage strong proprietary models to generate (i) fine-grained image annotations for vision-language alignment and (ii) complex reasoning visual question-answering pairs for visual instruction fine-tuning, yielding 1.3M samples in total. We train a series of lite VLMs on the synthetic dataset and experimental results demonstrate the effectiveness of the proposed scheme, where they achieve competitive performance on 17 benchmarks among 4B LVLMs, and even perform on par with 7B/13B-scale models on various benchmarks. This work highlights the feasibility of adopting high-quality data in crafting more efficient LVLMs. We name our dataset \textit{ALLaVA}, and open-source it to research community for developing better resource-efficient LVLMs for wider usage.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the performance gap between lightweight vision-language models (Lite Vision-Language Models, LVLMs) and traditional large-scale models. Specifically: 1. **Resource Consumption Issue**: Large vision-language models (LVLMs) exhibit strong reasoning and generalization capabilities across various vision-language tasks, but they require substantial computational resources for training and deployment. Therefore, researchers aim to develop a method to bridge the performance gap between lightweight models and resource-intensive models. 2. **Importance of High-Quality Data**: The paper proposes a comprehensive data generation process that utilizes powerful proprietary models to generate fine-grained image annotations and complex visual question-answer pairs, thereby creating a synthetic dataset containing 1.3 million samples. Experimental results show that lightweight models trained on this synthetic data can achieve performance comparable to larger models across multiple benchmarks. In summary, this research aims to demonstrate that using high-quality data can effectively enhance the performance of lightweight vision-language models, enabling them to achieve good results even with limited resources.

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

A-VL: Adaptive Attention for Large Vision-Language Models

Rethinking Overlooked Aspects in Vision-Language Models

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Improved Baselines with Visual Instruction Tuning

VILA$^2$: VILA Augmented VILA

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Video Instruction Tuning With Synthetic Data

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical

Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Small Language Model Meets with Reinforced Vision Vocabulary