Abstract:Recent advances in text-to-image diffusion models have demonstrated impressive capabilities in image quality. However, complex scene generation remains relatively unexplored, and even the definition of `complex scene' itself remains unclear. In this paper, we address this gap by providing a precise definition of complex scenes and introducing a set of Complex Decomposition Criteria (CDC) based on this definition. Inspired by the artists painting process, we propose a training-free diffusion framework called Complex Diffusion (CxD), which divides the process into three stages: composition, painting, and retouching. Our method leverages the powerful chain-of-thought capabilities of large language models (LLMs) to decompose complex prompts based on CDC and to manage composition and layout. We then develop an attention modulation method that guides simple prompts to specific regions to complete the complex scene painting. Finally, we inject the detailed output of the LLM into a retouching model to enhance the image details, thus implementing the retouching stage. Extensive experiments demonstrate that our method outperforms previous SOTA approaches, significantly improving the generation of high-quality, semantically consistent, and visually diverse images for complex scenes, even with intricate prompts.
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
The paper aims to address the challenges in generating complex scenes. Although existing text-to-image diffusion models perform well in generating high-quality images, they still have significant limitations when dealing with complex scenes. Specifically, when prompts contain multiple entities, complex spatial positions, and conflicting relationships, these models often exhibit issues such as entity omission, spatial inconsistency, and overall disharmony.
To tackle these problems, the authors propose a new training-free diffusion framework called **Complex Diffusion (CxD)**. This framework draws inspiration from the three stages of an artist's painting process: composition, painting, and retouching. Through this approach, CxD can effectively manage and generate complex scene images, producing high-quality, semantically consistent, and visually diverse images even with complex text prompts.
### Main Contributions
1. **Definitions and Standards**: The paper provides a clear experimental definition of complex scenes and introduces the Complex Decomposition Criteria (CDC) to effectively manage complex prompts.
2. **CxD Framework**: Inspired by the artist's creative process, a training-free Complex Diffusion (CxD) framework is proposed, dividing the generation of complex scene images into three stages: composition, painting, and retouching.
3. **Validation and Performance**: Extensive experiments demonstrate that CxD significantly outperforms existing state-of-the-art methods in generating high-quality, consistent, and diverse complex scene images, even when handling complex text prompts.
### Method Overview
1. **Composition and Layout Generation**:
- **Entity Extraction**: Utilize large language models (LLM) to extract entities and their attributes from complex scene prompts.
- **Prompt Rewriting**: Recompose the extracted entities and attributes into sub-prompts, ensuring each sub-prompt aligns as closely as possible with the relevant descriptions of the original complex prompt.
- **Prompt Merging or Splitting**: Based on the Complex Decomposition Criteria (CDC), use LLM to merge or split sub-prompts to generate simple prompts.
- **Layout Assignment**: Assign layouts to each simple prompt, ensuring the accuracy of the final composition.
2. **Cross-Attention Modulation**:
- **Prompt Batch Processing**: Use complex prompts, simple prompts, and background prompts as inputs to generate different latent representations.
- **Attention Enhancement Modulation**: Adjust the attention mechanism to ensure each region of the latent representation is emphasized, avoiding concept omission and enhancing details.
3. **Retouching with ControlNet**:
- **Detail Enhancement**: Use entities and attributes extracted by LLM as details, and retouch the generated images with the ControlNet-tile model to correct defects and add new details.
### Experimental Results
1. **Qualitative Evaluation**: CxD performs excellently in handling high complexity, precise spatial arrangements, and conflicting entities, generating harmonious and visually satisfying images.
2. **Quantitative Experiments**: In the T2I-Compbench benchmark, CxD significantly outperforms existing state-of-the-art methods in both general text-to-image generation and complex scene generation tasks, particularly excelling in object relationships and complex scene tasks.
3. **Ablation Studies**: By comparing the effects of different components, the importance of each component in generating complex scene images is verified.
In summary, CxD effectively addresses key issues in complex scene generation by introducing the Complex Decomposition Criteria and a multi-stage generation framework, providing a new solution for generating high-quality complex scene images.