Abstract:The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). In this work, we investigate an approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Despite the text-to-image model and VLM initially being trained on the same data, our approach leverages the image generator's ability to create novel compositions, resulting in synthetic image embeddings that expand beyond the limitations of the original dataset. Extensive experiments demonstrate that our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data, while requiring significantly less data. Furthermore, we perform a set of analyses on captions which reveals that semantic diversity and balance are key aspects for better downstream performance. Finally, we show that synthesizing images in the image embedding space is 25\% faster than in the pixel space. We believe our work not only addresses a significant challenge in VLM training but also opens up promising avenues for the development of self-improving multi-modal models.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the lack of high - quality labeled data faced by vision - language models (VLMs) during the training process. Specifically, the paper mentions that creating a high - quality human - labeled image - text pair dataset is a significant bottleneck. This problem restricts the development and performance improvement of VLMs because: 1. **Data Scarcity**: High - quality paired data is scarce, especially for complex multi - object scenes and detailed descriptions. 2. **Data Noise**: Data obtained from sources such as the Internet may be noisy and requires a great deal of cleaning work. 3. **High Labeling Cost**: The cost of manually labeling a large number of image - text pairs is high and time - consuming. 4. **Low Semantic Diversity and Balance**: The semantic diversity and balance in existing datasets are poor, resulting in poor performance of the model on certain tasks. To solve these problems, the paper proposes a method of using pre - trained large - language models (LLMs) and image - generation models to create synthetic image - text pairs. This method is achieved through the following steps: - **Synthetic Text Generation**: Use LLMs to generate high - quality synthetic text descriptions. - **Synthetic Image Generation**: Use a pre - trained text - to - image model to synthesize image embeddings from the generated text descriptions. - **Efficient Embedding Space Generation**: Generate images in the image embedding space instead of the pixel space, thereby improving efficiency and reducing resource consumption. Through these methods, the paper shows that the VLM trained with synthetic data achieves performance comparable to or even better than that of the model trained only with human - labeled data on multiple downstream tasks, while significantly reducing the amount of data required. This not only solves the problems of data scarcity and high cost but also improves the generalization ability and performance of the model.

Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models

Harnessing the Power of Large Vision Language Models for Synthetic Image Detection

Distilling Vision-Language Models on Millions of Videos

CompCap: Improving Multimodal Large Language Models with Composite Captions

Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models

Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

VILA$^2$: VILA Augmented VILA

Improving Visual Commonsense in Language Models via Multiple Image Generation

Improving Text Generation on Images with Synthetic Captions

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Generating Images with Multimodal Language Models

Image Captioning with Multi-Context Synthetic Data

SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

VLIS: Unimodal Language Models Guide Multimodal Language Generation

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data