Abstract:Learning from feedback has been shown to enhance the alignment between text prompts and images in text-to-image diffusion models. However, due to the lack of focus in feedback content, especially regarding the object type and quantity, these techniques struggle to accurately match text and images when faced with specified prompts. To address this issue, we propose an efficient fine-turning method with specific reward objectives, including three stages. First, generated images from diffusion model are detected to obtain the object categories and quantities. Meanwhile, the confidence of category and quantity can be derived from the detection results and given prompts. Next, we define a novel matching score, based on above confidence, to measure text-image alignment. It can guide the model for feedback learning in the form of a reward function. Finally, we fine-tune the diffusion model by backpropagation the reward function gradients to generate semantically related images. Different from previous feedbacks that focus more on overall matching, we place more emphasis on the accuracy of entity categories and quantities. Besides, we construct a text-to-image dataset for studying the compositional generation, including 1.7 K pairs of text-image with diverse combinations of entities and quantities. Experimental results on this benchmark show that our model outperforms other SOTA methods in both alignment and fidelity. In addition, our model can also serve as a metric for evaluating text-image alignment in other models. All code and dataset are available at <a class="link-external link-https" href="https://github.com/kingniu0329/Visions" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of inaccurate alignment between the generated images and the specified text prompts in text - to - image diffusion models, especially when dealing with complex prompts containing multiple object categories and quantities. Specifically: 1. **Text - image alignment problem**: Current text - to - image generation models often fail to accurately map the object categories and quantities in the text to the generated images when dealing with complex, combinatorial object descriptions. For example, for the prompt "1 tiger and 2 lions on a lotus leaf", the model may generate an image that does not match the prompt. 2. **Limitations of feedback mechanisms**: Existing feedback - based learning methods can improve the overall alignment between text and image, but they are insufficient in terms of focusing on the specific categories and quantities of objects. These methods usually rely on broader feedback such as semantic similarity or image quality and cannot precisely control the object details in the generated image. ### Solution To solve the above problems, the author proposes an efficient fine - tuning method to enhance the alignment ability of text - to - image diffusion models through specific reward objectives. This method includes three main stages: 1. **Object detection and confidence calculation**: - Generate an image using a pre - trained diffusion model. - Use an object detection model (such as YOLOS) to detect the object categories and quantities from the generated image and calculate the confidence of each category. - Compare the detection results with the text prompt to obtain the confidence of the category and quantity. 2. **Define a new matching score**: - Introduce a new matching score (CQ Score), which comprehensively considers the category confidence (Acc) and the quantity confidence (Aqc) and balances the contributions of the two through the harmonic mean. - Use the CQ Score as a reward function to guide the model for feedback learning. 3. **Fine - tuning based on the reward function**: - Fine - tune the diffusion model by back - propagating the gradient of the reward function to generate images that are more in line with the text prompt. - Combine the pre - training loss (Lpretrain) and the reward - driven loss (Lreward) to ensure that the model maintains the quality of the generated image while improving the alignment. In addition, the author also constructs a data set containing 1,700 pairs of text - image, which is specifically used to study combinatorial generation tasks. The experimental results show that this method is superior to other state - of - the - art methods in both alignment and image quality. ### Formula representation - **Category confidence (Acc)**: \[ \text{Acc} = \frac{1}{z_{nc}} \sum_{i = 1}^{z_{nc}} \frac{p_i^c}{z_{ni}^b\cdot I(z_i^c\in\{x_j^c\}_{j = 1}^{x_{nc}})} \] where \(p_i^c\) is the sum of the confidences of all bounding boxes of the \(i\)-th class of objects, \(z_{ni}^b\) is the number of bounding boxes of the \(i\)-th class of objects, and \(I\) is the indicator function. - **Quantity confidence (Aqc)**: \[ \text{Aqc} = \frac{1}{z_{nc}} \sum_{i = 1}^{z_{nc}} \frac{1}{x_{nc}} \sum_{j = 1}^{x_{nc}} \frac{\min(z_{ni}^b, x_{nj}^c)}{\max(z_{ni}^b, x_{nj}^c)} \] - **CQ Score**: \[ \text{CQ Score} = \frac{2\times\text{Acc}\times\text{Aqc}}{\text{Acc}+\text{Aqc}} \] Through these improvements, this paper successfully improves the alignment accuracy and image quality of text - to - image generation models when dealing with complex prompts.

Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback

RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment

Improving Long-Text Alignment for Text-to-Image Diffusion Models

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Aligning Text-to-Image Diffusion Models with Reward Backpropagation

Text-image Alignment for Diffusion-based Perception

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Towards Better Text-to-Image Generation Alignment via Attention Modulation

Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis

Enhancing semantic mapping in text-to-image diffusion via Gather-and-Bind

Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback

Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models

ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Segmentation-Free Guidance for Text-to-Image Diffusion Models

Improving Diffusion Models for Scene Text Editing with Dual Encoders