LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Long Lian,Boyi Li,Adam Yala,Trevor Darrell
2024-03-05
Abstract:Recent advancements in text-to-image diffusion models have yielded impressive results in generating realistic and diverse images. However, these models still struggle with complex prompts, such as those that involve numeracy and spatial reasoning. This work proposes to enhance prompt understanding capabilities in diffusion models. Our method leverages a pretrained large language model (LLM) for grounded generation in a novel two-stage process. In the first stage, the LLM generates a scene layout that comprises captioned bounding boxes from a given prompt describing the desired image. In the second stage, a novel controller guides an off-the-shelf diffusion model for layout-grounded image generation. Both stages utilize existing pretrained models without additional model parameter optimization. Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images according to prompts that require various capabilities, doubling the generation accuracy across four tasks on average. Furthermore, our method enables instruction-based multi-round scene specification and can handle prompts in languages not supported by the underlying diffusion model. We anticipate that our method will unleash users' creativity by accurately following more complex prompts. Our code, demo, and benchmark are available at: <a class="link-external link-https" href="https://llm-grounded-diffusion.github.io" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issues faced by text-to-image diffusion models in handling complex prompts, particularly those involving mathematical operations and spatial reasoning. Specifically, although existing diffusion models can generate realistic and diverse images, they still struggle with complex prompts. For example, these models may fail to correctly generate a specific number of objects, understand negation statements, or handle attribute binding or spatial relationships accurately. To solve these problems, the paper proposes a new method called LLM-grounded Diffusion (LMD). This method enhances the diffusion model's understanding of prompts through the following two stages: 1. **First Stage**: Utilize a pre-trained large language model (LLM) to generate a scene layout. In this stage, the LLM generates a scene layout with labeled bounding boxes based on the user's description of the desired image prompt, along with a simple prompt describing the background and an optional negative prompt (i.e., content that should not appear in the generated image). 2. **Second Stage**: Introduce a novel controller to guide an existing diffusion model to generate layout-based images. This stage ensures that the generated images meet the layout requirements produced in the first stage. The entire process does not require additional model parameter optimization, making this method applicable to various existing LLMs and diffusion models. Experimental results show that LMD significantly outperforms basic diffusion models and several other strong baseline models in accurately generating images, with an average generation accuracy improvement of 2 times across 4 tasks. Additionally, LMD supports multi-round instructional scene specification and can handle language prompts that basic diffusion models do not support, further enhancing its flexibility and practicality.