ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Guiming Hardy Chen,Shunian Chen,Ruifei Zhang,Junying Chen,Xiangbo Wu,Zhiyi Zhang,Zhihong Chen,Jianquan Li,Xiang Wan,Benyou Wang
2024-06-17
Abstract:Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and deployment. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data. To this end, we propose a comprehensive pipeline for generating a synthetic dataset. The key idea is to leverage strong proprietary models to generate (i) fine-grained image annotations for vision-language alignment and (ii) complex reasoning visual question-answering pairs for visual instruction fine-tuning, yielding 1.3M samples in total. We train a series of lite VLMs on the synthetic dataset and experimental results demonstrate the effectiveness of the proposed scheme, where they achieve competitive performance on 17 benchmarks among 4B LVLMs, and even perform on par with 7B/13B-scale models on various benchmarks. This work highlights the feasibility of adopting high-quality data in crafting more efficient LVLMs. We name our dataset \textit{ALLaVA}, and open-source it to research community for developing better resource-efficient LVLMs for wider usage.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the performance gap between lightweight vision-language models (Lite Vision-Language Models, LVLMs) and traditional large-scale models. Specifically: 1. **Resource Consumption Issue**: Large vision-language models (LVLMs) exhibit strong reasoning and generalization capabilities across various vision-language tasks, but they require substantial computational resources for training and deployment. Therefore, researchers aim to develop a method to bridge the performance gap between lightweight models and resource-intensive models. 2. **Importance of High-Quality Data**: The paper proposes a comprehensive data generation process that utilizes powerful proprietary models to generate fine-grained image annotations and complex visual question-answer pairs, thereby creating a synthetic dataset containing 1.3 million samples. Experimental results show that lightweight models trained on this synthetic data can achieve performance comparable to larger models across multiple benchmarks. In summary, this research aims to demonstrate that using high-quality data can effectively enhance the performance of lightweight vision-language models, enabling them to achieve good results even with limited resources.