Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Sahand Sharifzadeh,Christos Kaplanis,Shreya Pathak,Dharshan Kumaran,Anastasija Ilic,Jovana Mitrovic,Charles Blundell,Andrea Banino
2024-06-07
Abstract:The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). In this work, we investigate an approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Despite the text-to-image model and VLM initially being trained on the same data, our approach leverages the image generator's ability to create novel compositions, resulting in synthetic image embeddings that expand beyond the limitations of the original dataset. Extensive experiments demonstrate that our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data, while requiring significantly less data. Furthermore, we perform a set of analyses on captions which reveals that semantic diversity and balance are key aspects for better downstream performance. Finally, we show that synthesizing images in the image embedding space is 25\% faster than in the pixel space. We believe our work not only addresses a significant challenge in VLM training but also opens up promising avenues for the development of self-improving multi-modal models.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the lack of high - quality labeled data faced by vision - language models (VLMs) during the training process. Specifically, the paper mentions that creating a high - quality human - labeled image - text pair dataset is a significant bottleneck. This problem restricts the development and performance improvement of VLMs because: 1. **Data Scarcity**: High - quality paired data is scarce, especially for complex multi - object scenes and detailed descriptions. 2. **Data Noise**: Data obtained from sources such as the Internet may be noisy and requires a great deal of cleaning work. 3. **High Labeling Cost**: The cost of manually labeling a large number of image - text pairs is high and time - consuming. 4. **Low Semantic Diversity and Balance**: The semantic diversity and balance in existing datasets are poor, resulting in poor performance of the model on certain tasks. To solve these problems, the paper proposes a method of using pre - trained large - language models (LLMs) and image - generation models to create synthetic image - text pairs. This method is achieved through the following steps: - **Synthetic Text Generation**: Use LLMs to generate high - quality synthetic text descriptions. - **Synthetic Image Generation**: Use a pre - trained text - to - image model to synthesize image embeddings from the generated text descriptions. - **Efficient Embedding Space Generation**: Generate images in the image embedding space instead of the pixel space, thereby improving efficiency and reducing resource consumption. Through these methods, the paper shows that the VLM trained with synthetic data achieves performance comparable to or even better than that of the model trained only with human - labeled data on multiple downstream tasks, while significantly reducing the amount of data required. This not only solves the problems of data scarcity and high cost but also improves the generalization ability and performance of the model.