Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Zhennan Chen,Yajie Li,Haofan Wang,Zhibo Chen,Zhengkai Jiang,Jun Li,Qian Wang,Jian Yang,Ying Tai
2024-11-15
Abstract:Regional prompting, or compositional generation, which enables fine-grained spatial control, has gained increasing attention for its practicality in real-world applications. However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within cross-attention layers using attention masks, resulting in limited control strength when the number of regions increases. To handle these limitations, we present RAG, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. RAG decouple the multi-region generation into two sub-tasks, the construction of individual region (Regional Hard Binding) that ensures the regional prompt is properly executed, and the overall detail refinement (Regional Soft Refinement) over regions that dismiss the visual boundaries and enhance adjacent interactions. Furthermore, RAG novelly makes repainting feasible, where users can modify specific unsatisfied regions in the last generation while keeping all other regions unchanged, without relying on additional inpainting models. Our approach is tuning-free and applicable to other frameworks as an enhancement to the prompt following property. Quantitative and qualitative experiments demonstrate that RAG achieves superior performance over attribute binding and object relationship than previous tuning-free methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve fine - grained spatial control in text - to - image generation. Although existing generation models have made remarkable progress in generating high - quality images from text, there are still challenges in understanding the number of objects and their spatial arrangement. To address these limitations, the authors propose RAG (Region - Aware text - to - image Generation), a region - aware text - to - image generation method based on region descriptions, aiming to provide more refined control in precise layout combinations. Specifically, RAG handles multi - region generation through two subtasks: 1. **Regional Hard Binding**: Ensure that each region prompt is accurately executed. By decomposing the input prompt into multiple region prompts in the early stage of the denoising process, and handling the latent representation of each region separately. 2. **Regional Soft Refinement**: Enhance the interaction between adjacent regions in subsequent steps, eliminate visual boundaries, and improve the harmony of overall details. In addition, RAG also supports the image redrawing function. Users can modify specific unsatisfactory regions while keeping other regions unchanged without relying on additional inpainting models. This feature makes RAG perform well when dealing with complex multi - region prompts, outperforming previous tuning - free methods. Through quantitative and qualitative experiments, the paper demonstrates the superior performance of RAG in attribute binding, object relationships, and complex composition generation.