DivCon: Divide and Conquer for Progressive Text-to-Image Generation

Yuhao Jia,Wenhan Tan
2024-08-17
Abstract:Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements. To further improve T2I models' capability in numerical and spatial reasoning, the layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods still struggle with generating images from textural prompts with multiple objects and complicated spatial relationships. To tackle this challenge, we introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks. Our approach divides the layout prediction stage into numerical & spatial reasoning and bounding box prediction. Then, the layout-to-image generation stage is conducted in an iterative manner to reconstruct objects from easy ones to difficult ones. We conduct experiments on the HRS and NSR-1K benchmarks and our approach outperforms previous state-of-the-art models with notable margins. In addition, visual results demonstrate that our approach significantly improves the controllability and consistency in generating multiple objects from complex textural prompts.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the Text - to - Image (T2I) task, when the text prompt contains multiple objects and complex spatial relationships, the existing methods have limited ability in generating high - quality images. Specifically, the existing T2I models perform poorly when dealing with text prompts with a specified number of objects, different sizes, rich details and complex spatial relationships. These problems are mainly reflected in the following aspects: 1. **Insufficient numerical and spatial reasoning ability**: The existing T2I models face challenges in performing numerical and spatial reasoning from complex text prompts, especially when dealing with multiple objects and complex spatial relationships. 2. **Inconsistent generation quality**: Since the generation difficulties of different objects are different, the existing layout - to - image generation models show different generation abilities when generating objects with different characteristics. This causes some high - difficulty objects to be poorly synthesized into the image. 3. **High computational cost**: Some methods generate each object separately through multiple forward propagations, but this method significantly increases the computational cost as the number of objects increases. To address these challenges, the paper proposes a method based on the divide - and - conquer strategy - DivCon. DivCon improves the quality and accuracy of the generated image by decomposing the complex generation task into multiple subtasks. Specifically, the main contributions of DivCon include: - **Divide - and - conquer strategy**: Decompose the complex layout prediction and image generation tasks into multiple simple subtasks, thereby improving the generation quality and accuracy. - **Layout prediction stage**: Divide the layout prediction into two steps: numerical and spatial reasoning and bounding box prediction, in order to more accurately parse the component information in the text prompt. - **Layout - to - image generation stage**: Divide the layout - to - image generation into two steps and gradually generate objects with different difficulties to achieve higher - quality image generation. Through these improvements, the experimental results of DivCon on the HRS and NSR - 1K benchmark datasets show that it is superior to the existing T2I models in terms of numerical and spatial reasoning performance and the quality of the generated image.