Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Zhennan Chen,Yajie Li,Haofan Wang,Zhibo Chen,Zhengkai Jiang,Jun Li,Qian Wang,Jian Yang,Ying Tai

2024-11-15

Abstract:Regional prompting, or compositional generation, which enables fine-grained spatial control, has gained increasing attention for its practicality in real-world applications. However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within cross-attention layers using attention masks, resulting in limited control strength when the number of regions increases. To handle these limitations, we present RAG, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. RAG decouple the multi-region generation into two sub-tasks, the construction of individual region (Regional Hard Binding) that ensures the regional prompt is properly executed, and the overall detail refinement (Regional Soft Refinement) over regions that dismiss the visual boundaries and enhance adjacent interactions. Furthermore, RAG novelly makes repainting feasible, where users can modify specific unsatisfied regions in the last generation while keeping all other regions unchanged, without relying on additional inpainting models. Our approach is tuning-free and applicable to other frameworks as an enhancement to the prompt following property. Quantitative and qualitative experiments demonstrate that RAG achieves superior performance over attribute binding and object relationship than previous tuning-free methods.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve fine - grained spatial control in text - to - image generation. Although existing generation models have made remarkable progress in generating high - quality images from text, there are still challenges in understanding the number of objects and their spatial arrangement. To address these limitations, the authors propose RAG (Region - Aware text - to - image Generation), a region - aware text - to - image generation method based on region descriptions, aiming to provide more refined control in precise layout combinations. Specifically, RAG handles multi - region generation through two subtasks: 1. **Regional Hard Binding**: Ensure that each region prompt is accurately executed. By decomposing the input prompt into multiple region prompts in the early stage of the denoising process, and handling the latent representation of each region separately. 2. **Regional Soft Refinement**: Enhance the interaction between adjacent regions in subsequent steps, eliminate visual boundaries, and improve the harmony of overall details. In addition, RAG also supports the image redrawing function. Users can modify specific unsatisfactory regions while keeping other regions unchanged without relying on additional inpainting models. This feature makes RAG perform well when dealing with complex multi - region prompts, outperforming previous tuning - free methods. Through quantitative and qualitative experiments, the paper demonstrates the superior performance of RAG in attribute binding, object relationships, and complex composition generation.

Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Region-Aware Image Captioning Via Interaction Learning

Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt

Text-Driven Image Editing via Learnable Regions

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

Expressive Text-to-Image Generation with Rich Text

DreamBooth++: Boosting Subject-Driven Generation Via Region-Level References Packing

ReCo: Region-Controlled Text-to-Image Generation

Realistic Image Generation using Region-phrase Attention

Where You Edit is What You Get: Text-guided Image Editing with Region-Based Attention.

Region-Aware Portrait Retouching with Sparse Interactive Guidance

Multi-Region Text-Driven Manipulation of Diffusion Imagery

Local Conditional Controlling for Text-to-Image Diffusion Models

Region-Aware Diffusion for Zero-shot Text-driven Image Editing

Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection

R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative Adversarial Networks

Dynamic Prompt Optimizing for Text-to-Image Generation

Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

DT2I: Dense Text-to-Image Generation from Region Descriptions