LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

Hanan Gani,Shariq Farooq Bhat,Muzammal Naseer,Salman Khan,Peter Wonka

2024-02-26

Abstract:Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in generating images from short, single-object descriptions, these models often struggle to faithfully capture all the nuanced details within longer and more elaborate textual inputs. In response, we present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the challenges faced by current text-to-image generation techniques based on diffusion models when dealing with long and detailed text prompts. Specifically, while existing models excel at generating images from short, single-object descriptions, they often fail to faithfully capture all details when handling long text prompts that describe complex scenes with multiple objects. This can result in missing objects, inaccurate positioning, or generated objects that do not match the text description. To solve these issues, the paper proposes a new method that leverages large language models (LLMs) to extract key components from long text prompts, including the bounding box coordinates of foreground objects, detailed text descriptions of each object, and concise background context. These components form the foundation of a layout-to-image generation model, which operates in two stages: global scene generation and iterative refinement. Through this approach, the paper aims to improve the ability to generate images from long text prompts, ensuring that the final images faithfully reflect the complex text descriptions.

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

LLMGA: Multimodal Large Language Model based Generation Assistant

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

GLoD: Composing Global Contexts and Local Details in Image Generation

Obtaining Favorable Layouts for Multiple Object Generation

Compositional Text-to-Image Generation with Dense Blob Representations

Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

Batch-Instructed Gradient for Prompt Evolution:Systematic Prompt Optimization for Enhanced Text-to-Image Synthesis

Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

Zero-shot Text-guided Infinite Image Synthesis with LLM guidance

Create Your World: Lifelong Text-to-Image Diffusion

Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

Multi-modal Generation via Cross-Modal In-Context Learning

DiffusionGPT: LLM-Driven Text-to-Image Generation System

Object-level Visual Prompts for Compositional Image Generation