Abstract:Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's"paint-with-words"capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/

RLEG: Vision-Language Representation Learning with Diffusion-based Embedding Generation

RWKV-CLIP: A Robust Vision-Language Representation Learner

Implicit and Explicit Language Guidance for Diffusion-based Visual Perception

Large-scale Reinforcement Learning for Diffusion Models

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Speech Guided Disentangled Visual Representation Learning for Lip Reading

ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models

Diffusion Feedback Helps CLIP See Better

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Are Diffusion Models Vision-And-Language Reasoners?

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

Prompting Visual-Language Models for Dynamic Facial Expression Recognition

Denoising Autoregressive Representation Learning

Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

Do text-free diffusion models learn discriminative visual representations?

Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Unified Discrete Diffusion for Simultaneous Vision-Language Generation