Abstract:Recent advancements in text-to-image diffusion models have yielded impressive results in generating realistic and diverse images. However, these models still struggle with complex prompts, such as those that involve numeracy and spatial reasoning. This work proposes to enhance prompt understanding capabilities in diffusion models. Our method leverages a pretrained large language model (LLM) for grounded generation in a novel two-stage process. In the first stage, the LLM generates a scene layout that comprises captioned bounding boxes from a given prompt describing the desired image. In the second stage, a novel controller guides an off-the-shelf diffusion model for layout-grounded image generation. Both stages utilize existing pretrained models without additional model parameter optimization. Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images according to prompts that require various capabilities, doubling the generation accuracy across four tasks on average. Furthermore, our method enables instruction-based multi-round scene specification and can handle prompts in languages not supported by the underlying diffusion model. We anticipate that our method will unleash users' creativity by accurately following more complex prompts. Our code, demo, and benchmark are available at: <a class="link-external link-https" href="https://llm-grounded-diffusion.github.io" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The paper aims to address the issues faced by text-to-image diffusion models in handling complex prompts, particularly those involving mathematical operations and spatial reasoning. Specifically, although existing diffusion models can generate realistic and diverse images, they still struggle with complex prompts. For example, these models may fail to correctly generate a specific number of objects, understand negation statements, or handle attribute binding or spatial relationships accurately. To solve these problems, the paper proposes a new method called LLM-grounded Diffusion (LMD). This method enhances the diffusion model's understanding of prompts through the following two stages: 1. **First Stage**: Utilize a pre-trained large language model (LLM) to generate a scene layout. In this stage, the LLM generates a scene layout with labeled bounding boxes based on the user's description of the desired image prompt, along with a simple prompt describing the background and an optional negative prompt (i.e., content that should not appear in the generated image). 2. **Second Stage**: Introduce a novel controller to guide an existing diffusion model to generate layout-based images. This stage ensures that the generated images meet the layout requirements produced in the first stage. The entire process does not require additional model parameter optimization, making this method applicable to various existing LLMs and diffusion models. Experimental results show that LMD significantly outperforms basic diffusion models and several other strong baseline models in accurately generating images, with an average generation accuracy improvement of 2 times across 4 tasks. Additionally, LMD supports multi-round instructional scene specification and can handle language prompts that basic diffusion models do not support, further enhancing its flexibility and practicality.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

LLM-grounded Video Diffusion Models

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

In-Context Learning Unlocked for Diffusion Models

DiffusionGPT: LLM-Driven Text-to-Image Generation System

Decoder-Only LLMs Are Better Controllers for Diffusion Models

SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

LLMGA: Multimodal Large Language Model based Generation Assistant

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

Guiding Text-to-Image Diffusion Model Towards Grounded Generation

GLoD: Composing Global Contexts and Local Details in Image Generation

Create Your World: Lifelong Text-to-Image Diffusion

MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion

Unleashing Text-to-Image Diffusion Models for Visual Perception

DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models

MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion

PromptFix: You Prompt and We Fix the Photo

ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models