Abstract:Diffusion models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts involving multiple objects, attribute binding, and long descriptions. In this paper, we propose a novel framework called \textbf{LLM4GEN}, which enhances the semantic understanding of text-to-image diffusion models by leveraging the representation of Large Language Models (LLMs). It can be seamlessly incorporated into various diffusion models as a plug-and-play component. A specially designed Cross-Adapter Module (CAM) integrates the original text features of text-to-image models with LLM features, thereby enhancing text-to-image generation. Additionally, to facilitate and correct entity-attribute relationships in text prompts, we develop an entity-guided regularization loss to further improve generation performance. We also introduce DensePrompts, which contains $7,000$ dense prompts to provide a comprehensive evaluation for the text-to-image generation task. Experiments indicate that LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, demonstrating increases of 9.69\% and 12.90\% in color on T2I-CompBench, respectively. Moreover, it surpasses existing models in terms of sample quality, image-text alignment, and human evaluation.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the challenges faced by existing Text - to - Image Generation Models when dealing with complex and dense prompts. Specifically, these models perform poorly when generating images containing multiple objects, attribute bindings, and long descriptions. To address these issues, the authors propose a new framework, **LLM4GEN**, which enhances the performance of text - to - image diffusion models by leveraging the powerful semantic representation capabilities of large - language models (LLMs). #### Main problems: 1. **Processing of complex and dense prompts**: Existing diffusion models have difficulties in handling complex prompts involving multiple objects, attribute bindings, and long descriptions. 2. **Insufficient semantic understanding**: The representation capabilities of original text encoders (such as the CLIP text encoder) are limited, making it difficult to accurately capture complex semantic information. 3. **Requirements for training data and computational resources**: Existing methods require a large amount of training data and computational resources to align the representations of LLMs with diffusion models. ### Solutions of LLM4GEN: 1. **Cross - Adapter Module (CAM)**: An efficient cross - adapter module is designed to fuse the semantic representations of LLMs with the features of the original text encoder, thereby enhancing the semantic understanding of text - to - image generation. 2. **Entity - guided regularization loss**: A new loss function is introduced to correct the entity - attribute relationships in text prompts, improving the consistency and accuracy of the generated images. 3. **DensePrompts benchmark**: A new benchmark containing 7,000 dense prompts is created to comprehensively evaluate the performance of text - to - image generation tasks. Through these improvements, LLM4GEN can significantly improve the quality of generated images, text - image alignment, and outperform existing models in terms of sample quality and human evaluation. ### Summary of mathematical formulas: - **Cross - attention mechanism**: \[ Q = W_q(cl), \quad K = W_k(ct), \quad V = W_v(ct) \] \[ c'_l=\text{CrossAttention}(Q, K, V)=\text{softmax}\left(\frac{Q\cdot K^T}{\sqrt{d}}\right)\cdot V \] \[ x = \lambda\cdot CA(x, cl)+CA(x, ct) \] - **Entity - guided regularization loss**: \[ L_{\text{reg}}=\frac{1}{N\cdot L}\sum_{i = 1}^{N}\sum_{l = 1}^{L}\|A_i^a - A_i^o\|_2^2 \] - **Total training loss**: \[ L=\mathbb{E}_{\epsilon(x),\epsilon\sim\mathcal{N}(0,1),t}\left[\|\epsilon-\epsilon_\theta(z_t, t)\|_2^2\right]+\alpha\cdot L_{\text{reg}} \] These formulas show how to enhance the performance of text - to - image generation models by fusing the semantic representations of LLMs and the original text encoder.

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

LLMGA: Multimodal Large Language Model based Generation Assistant

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

DiffusionGPT: LLM-Driven Text-to-Image Generation System

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

LLM-grounded Video Diffusion Models

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

Compositional Text-to-Image Generation with Dense Blob Representations

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion

Zero-shot Text-guided Infinite Image Synthesis with LLM guidance

Create Your World: Lifelong Text-to-Image Diffusion