Abstract:Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation. We identified two main obstacles behind this issue. One is the misalignment between the next token prediction training in LLM and the requirement for discriminative prompt features in diffusion models. The other is the intrinsic positional bias introduced by the decoder-only architecture. To deal with this issue, we propose a novel framework to fully harness the capabilities of LLMs. Through the carefully designed usage guidance, we effectively enhance the text representation capability for prompt encoding and eliminate its inherent positional bias. This allows us to integrate state-of-the-art LLMs into the text-to-image generation model flexibly. Furthermore, we also provide an effective manner to fuse multiple LLMs into our framework. Considering the excellent performance and scaling capabilities demonstrated by the transformer architecture, we further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework. We conduct extensive experiments to validate LI-DiT across model size and data size. Benefiting from the inherent ability of the LLMs and our innovative designs, the prompt understanding performance of LI-DiT easily surpasses state-of-the-art open-source models as well as mainstream closed-source commercial models including Stable Diffusion 3, DALL-E 3, and Midjourney V6. The powerful LI-DiT-10B will be available through the online platform and API after further optimization and security checks.

Decoder-Only LLMs Are Better Controllers for Diffusion Models

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

Emage: Non-Autoregressive Text-to-Image Generation

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Create Your World: Lifelong Text-to-Image Diffusion

SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

DiffusionGPT: LLM-Driven Text-to-Image Generation System

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models

Unleashing Text-to-Image Diffusion Models for Visual Perception

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model

Diffusion-Adapter: Text Guided Image Manipulation with Frozen Diffusion Models

LLM-grounded Video Diffusion Models

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

De-Diffusion Makes Text a Strong Cross-Modal Interface

DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception