GOAL: Grounded Text-to-image Synthesis with Joint Layout Alignment Tuning
Yaqi Li,Han Fang,Zerun Feng,Kaijing Ma,Chao Ban,Xianghao Zang,Lanxiang Zhou,Zhongjiang He,Jingyan Chen,Jiani Hu,Hao Sun,Huayu Zhang
DOI: https://doi.org/10.1145/3664647.3681284
2024-01-01
Abstract:Recent text-to-image (T2I) synthesis models have demonstrated intriguing abilities to produce high-quality images based on text prompts. However, current models still face Text-Image Misalignment problem (e.g., attribute errors and relation mistakes) for compositional generation. Existing models attempted to condition T2I models on grounding inputs to improve controllability while ignoring the explicit supervision from the layout conditions. To tackle this issue, we propose Grounded jOint lAyout aLignment (GOAL), an effective framework for T2I synthesis. Two novel modules, discriminative semantic alignment (DSAlign) and masked attention alignment (MAAlign), are proposed and incorporated in this framework to improve the text-image alignment. DSAlign leverages discriminative tasks at the region-wise level to ensure low-level semantic alignment. MAAlign provides high-level attention alignment by guiding the model to focus on the target object. We also build a dataset GOAL2K for model fine-tuning, which composes 2000 semantically accurate image-text pairs and their layout annotations. Comprehensive evaluations on T2I-Compbench, NSR-1K, and Drawbench demonstrate the superior generation performance of our method. Especially, there are improvements of 19%, 13%, and 12% in color, shape, and texture metrics for T2I-Compbench. Additionally, Q-Align metrics demonstrate that our method can generate images of higher quality.