Abstract:Despite advancements in text-to-image models, generating images that precisely align with textual descriptions remains challenging due to misalignment in training data. In this paper, we analyze the critical role of caption precision and recall in text-to-image model training. Our analysis of human-annotated captions shows that both precision and recall are important for text-image alignment, but precision has a more significant impact. Leveraging these insights, we utilize Large Vision Language Models to generate synthetic captions for training. Models trained with these synthetic captions show similar behavior to those trained on human-annotated captions, underscores the potential for synthetic data in text-to-image training.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: during the training process of text - to - image generation models (Text - to - Image, T2I), how to improve the alignment accuracy between the generated image and the text description by optimizing the quality of image captions. Specifically, the author focuses on the influence of caption precision and recall on the training effect of the T2I model. ### Background and Problem Although diffusion models (such as Stable Diffusion, DALL·E 3, Emu, Imagen, etc.) have made significant progress in image synthesis in recent years, they still face challenges when generating images that accurately reflect text descriptions. One of the key problems is the misalignment between captions and images in the training data, that is, the captions may only describe part of the image content or fail to accurately describe the image content. ### Research Objectives To solve the above problems, the author analyzes the importance of caption precision and recall in the training of the T2I model in this paper. Specific research objectives include: 1. **Evaluating Caption Quality**: Systematically evaluate caption precision and recall to determine their impact on the performance of the T2I model. 2. **Generating Synthetic Captions**: Use large - scale visual - language models (LVLMs) to generate synthetic captions and evaluate the performance of these captions in the training of the T2I model. 3. **Verifying Conclusions**: Confirm whether the training results of synthetic captions are consistent with the training results of human - annotated captions, thereby verifying the importance of precision and recall. ### Main Contributions - **Systematic Evaluation**: The author systematically evaluates the impact of precision and recall on the training of the T2I model and finds that although both are important, precision has a more significant impact on model performance. - **Synthetic Caption Experiment**: Experiments are carried out by using multiple LVLMs to generate synthetic captions. The results show that the performance of the T2I model trained with these synthetic captions is consistent with the results of training with human - annotated captions, further proving the importance of precision. ### Formula Representation Some formulas involved in the paper are as follows: - The loss function of the diffusion model: \[ L := \mathbb{E}_{\epsilon(x), c, \epsilon \sim \mathcal{N}(0, 1)} \left[ \left\| \epsilon - \epsilon_\theta(z_t, t, \tau_\theta(c)) \right\|^2_2 \right] \] where $\epsilon$ is noise, $t$ represents the denoising time step, $\theta$ is the parameter of the diffusion model, and $\epsilon$ and $\tau$ are the image and text encoders respectively. Through these studies, the author hopes to provide more effective caption - generation strategies for the future training of T2I models, especially in the application of synthetic captions.

Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model

User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

Image Captions Are Natural Prompts for Text-to-Image Models

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

Improving Text Generation on Images with Synthetic Captions

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Altogether: Image Captioning via Re-aligning Alt-text

See or Guess: Counterfactually Regularized Image Captioning

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Evaluating Data Attribution for Text-to-Image Models

Image Captioning with Multi-Context Synthetic Data

Image captioning with weakly-supervised attention penalty

Harnessing the Power of Large Vision Language Models for Synthetic Image Detection

Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

From Captions to Visual Concepts and Back

Automatic Caption Generation for News Images

Visuals to Text: A Comprehensive Review on Automatic Image Captioning

Distilling Vision-Language Models on Millions of Videos

Improving face generation quality and prompt following with synthetic captions

CLAIR: Evaluating Image Captions with Large Language Models