Abstract:Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements. To further improve T2I models' capability in numerical and spatial reasoning, the layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods still struggle with generating images from textural prompts with multiple objects and complicated spatial relationships. To tackle this challenge, we introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks. Our approach divides the layout prediction stage into numerical & spatial reasoning and bounding box prediction. Then, the layout-to-image generation stage is conducted in an iterative manner to reconstruct objects from easy ones to difficult ones. We conduct experiments on the HRS and NSR-1K benchmarks and our approach outperforms previous state-of-the-art models with notable margins. In addition, visual results demonstrate that our approach significantly improves the controllability and consistency in generating multiple objects from complex textural prompts.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the Text - to - Image (T2I) task, when the text prompt contains multiple objects and complex spatial relationships, the existing methods have limited ability in generating high - quality images. Specifically, the existing T2I models perform poorly when dealing with text prompts with a specified number of objects, different sizes, rich details and complex spatial relationships. These problems are mainly reflected in the following aspects: 1. **Insufficient numerical and spatial reasoning ability**: The existing T2I models face challenges in performing numerical and spatial reasoning from complex text prompts, especially when dealing with multiple objects and complex spatial relationships. 2. **Inconsistent generation quality**: Since the generation difficulties of different objects are different, the existing layout - to - image generation models show different generation abilities when generating objects with different characteristics. This causes some high - difficulty objects to be poorly synthesized into the image. 3. **High computational cost**: Some methods generate each object separately through multiple forward propagations, but this method significantly increases the computational cost as the number of objects increases. To address these challenges, the paper proposes a method based on the divide - and - conquer strategy - DivCon. DivCon improves the quality and accuracy of the generated image by decomposing the complex generation task into multiple subtasks. Specifically, the main contributions of DivCon include: - **Divide - and - conquer strategy**: Decompose the complex layout prediction and image generation tasks into multiple simple subtasks, thereby improving the generation quality and accuracy. - **Layout prediction stage**: Divide the layout prediction into two steps: numerical and spatial reasoning and bounding box prediction, in order to more accurately parse the component information in the text prompt. - **Layout - to - image generation stage**: Divide the layout - to - image generation into two steps and gradually generate objects with different difficulties to achieve higher - quality image generation. Through these improvements, the experimental results of DivCon on the HRS and NSR - 1K benchmark datasets show that it is superior to the existing T2I models in terms of numerical and spatial reasoning performance and the quality of the generated image.

DivCon: Divide and Conquer for Progressive Text-to-Image Generation

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Progressive Compositionality In Text-to-Image Generative Models

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

Progressive Text-to-Image Diffusion with Soft Latent Direction

Training-free Composite Scene Generation for Layout-to-Image Synthesis

TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition

Obtaining Favorable Layouts for Multiple Object Generation

HiCo: Hierarchical Controllable Diffusion Model for Layout-to-image Generation

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

LayoutDiT: Exploring Content-Graphic Balance in Layout Generation with Diffusion Transformer

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance

Layout Control and Semantic Guidance with Attention Loss Backward for T2I Diffusion Model

iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image Diffusion Model for Interior Design

Spatial-Aware Latent Initialization for Controllable Image Generation

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation