CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Dongzhi Jiang,Guanglu Song,Xiaoshi Wu,Renrui Zhang,Dazhong Shen,Zhuofan Zong,Yu Liu,Hongsheng Li
2024-06-03
Abstract:Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation. We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm. To address the issue, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens. A novel attribute concentration module is also proposed to address the attribute binding problem. Without any image or human preference data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL. Extensive experiments show that CoMat-SDXL significantly outperforms the baseline model SDXL in two text-to-image alignment benchmarks and achieves start-of-the-art performance.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### The Problem Addressed by the Paper This paper aims to address the alignment issue in text-to-image generation. Specifically, existing diffusion models often fail to stay consistent with text prompts when generating images, especially when dealing with complex prompts. The main issues can be divided into two aspects: 1. **Concept Ignorance**: Diffusion models sometimes ignore certain concepts in the text prompts, resulting in these concepts being absent in the generated images. 2. **Concept Mismapping**: Even if a concept is activated, diffusion models may fail to correctly map it to the corresponding area in the image. To solve these problems, the authors propose **CoMat**, an end-to-end fine-tuning strategy that enhances the diffusion model's understanding and adherence to text conditions by introducing a concept matching mechanism from image to text. The specific methods include: - **Concept Activation Module**: Detects missing concepts in the generated images through a pre-trained image-to-text model and guides the diffusion model to refocus on these ignored concepts. - **Attribute Concentration Module**: Ensures that each entity attribute in the text prompt is correctly mapped to the corresponding area in the image. Through these methods, CoMat significantly improves text-to-image alignment in multiple benchmark tests, especially in terms of object presence, attribute binding, relationships, and complex prompts.