CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Dongzhi Jiang,Guanglu Song,Xiaoshi Wu,Renrui Zhang,Dazhong Shen,Zhuofan Zong,Yu Liu,Hongsheng Li

2024-06-03

Abstract:Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation. We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm. To address the issue, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens. A novel attribute concentration module is also proposed to address the attribute binding problem. Without any image or human preference data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL. Extensive experiments show that CoMat-SDXL significantly outperforms the baseline model SDXL in two text-to-image alignment benchmarks and achieves start-of-the-art performance.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

### The Problem Addressed by the Paper This paper aims to address the alignment issue in text-to-image generation. Specifically, existing diffusion models often fail to stay consistent with text prompts when generating images, especially when dealing with complex prompts. The main issues can be divided into two aspects: 1. **Concept Ignorance**: Diffusion models sometimes ignore certain concepts in the text prompts, resulting in these concepts being absent in the generated images. 2. **Concept Mismapping**: Even if a concept is activated, diffusion models may fail to correctly map it to the corresponding area in the image. To solve these problems, the authors propose **CoMat**, an end-to-end fine-tuning strategy that enhances the diffusion model's understanding and adherence to text conditions by introducing a concept matching mechanism from image to text. The specific methods include: - **Concept Activation Module**: Detects missing concepts in the generated images through a pre-trained image-to-text model and guides the diffusion model to refocus on these ignored concepts. - **Attribute Concentration Module**: Ensures that each entity attribute in the text prompt is correctly mapped to the corresponding area in the image. Through these methods, CoMat significantly improves text-to-image alignment in multiple benchmark tests, especially in terms of object presence, attribute binding, relationships, and complex prompts.

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment

Towards Better Text-to-Image Generation Alignment via Attention Modulation

Improving Long-Text Alignment for Text-to-Image Diffusion Models

Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback

Text-image Alignment for Diffusion-based Perception

Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models

MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing

ECNet: Effective Controllable Text-to-Image Diffusion Models

Exploring Discrete Diffusion Models for Image Captioning

Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance

An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models

Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models