Improving Long-Text Alignment for Text-to-Image Diffusion Models

Luping Liu,Chao Du,Tianyu Pang,Zehan Wang,Chongxuan Li,Dong Xu
2024-10-16
Abstract:The rapid advancement of text-to-image (T2I) diffusion models has enabled them to generate unprecedented results from given texts. However, as text inputs become longer, existing encoding methods like CLIP face limitations, and aligning the generated images with long texts becomes challenging. To tackle these issues, we propose LongAlign, which includes a segment-level encoding method for processing long texts and a decomposed preference optimization method for effective alignment training. For segment-level encoding, long texts are divided into multiple segments and processed separately. This method overcomes the maximum input length limits of pretrained encoding models. For preference optimization, we provide decomposed CLIP-based preference models to fine-tune diffusion models. Specifically, to utilize CLIP-based preference models for T2I alignment, we delve into their scoring mechanisms and find that the preference scores can be decomposed into two components: a text-relevant part that measures T2I alignment and a text-irrelevant part that assesses other visual aspects of human preference. Additionally, we find that the text-irrelevant part contributes to a common overfitting problem during fine-tuning. To address this, we propose a reweighting strategy that assigns different weights to these two components, thereby reducing overfitting and enhancing alignment. After fine-tuning $512 \times 512$ Stable Diffusion (SD) v1.5 for about 20 hours using our method, the fine-tuned SD outperforms stronger foundation models in T2I alignment, such as PixArt-$\alpha$ and Kandinsky v2.2. The code is available at <a class="link-external link-https" href="https://github.com/luping-liu/LongAlign" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Machine Learning,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to better encode long texts when generating images and ensure that the generated images can be precisely aligned with the long - text descriptions. Specifically: 1. **Challenges in long - text encoding**: Existing text - to - image (T2I) diffusion models face limitations when dealing with long - text inputs, especially when the text length exceeds the maximum input length of pre - trained encoders (such as CLIP). This results in the generated images being unable to accurately reflect all the details in the long texts. 2. **Alignment problems**: Even with long - text inputs, existing models have difficulty ensuring that the generated images are fully aligned with the text descriptions, especially when the text descriptions are complex and contain multiple sentences. The generated images often can only partially reflect the details in the text. To address these challenges, the authors propose a method named **LongAlign**, which consists of two main components: - **Paragraph - level encoding**: Divide the long text into multiple short paragraphs, encode them separately, and then merge the results. This method overcomes the maximum input - length limitation of pre - trained encoders, enabling the model to handle longer - text inputs. - **Decomposed preference optimization**: By analyzing the scoring mechanism of the preference model, decompose the preference score into text - related and text - unrelated parts. The authors find that the text - unrelated part is prone to overfitting problems during the fine - tuning process. Therefore, a re - weighting strategy is proposed to assign different weights to these two parts, thereby reducing overfitting and enhancing the alignment effect. Through these methods, LongAlign can significantly improve the alignment quality between the generated images and the text when dealing with long - text inputs. Experimental results show that after about 20 hours of fine - tuning, the Stable Diffusion model using the LongAlign method outperforms other stronger base models, such as PixArt - α and Kandinsky v2.2, in long - text alignment tasks.