Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models

Juntu Zhao,Junyu Deng,Yixin Ye,Chongxuan Li,Zhijie Deng,Dequan Wang

2024-08-05

Abstract:Advancements in text-to-image diffusion models have broadened extensive downstream practical applications, but such models often encounter misalignment issues between text and image. Taking the generation of a combination of two disentangled concepts as an example, say given the prompt "a tea cup of iced coke", existing models usually generate a glass cup of iced coke because the iced coke usually co-occurs with the glass cup instead of the tea one during model training. The root of such misalignment is attributed to the confusion in the latent semantic space of text-to-image diffusion models, and hence we refer to the "a tea cup of iced coke" phenomenon as Latent Concept Misalignment (LC-Mis). We leverage large language models (LLMs) to thoroughly investigate the scope of LC-Mis, and develop an automated pipeline for aligning the latent semantics of diffusion models to text prompts. Empirical assessments confirm the effectiveness of our approach, substantially reducing LC-Mis errors and enhancing the robustness and versatility of text-to-image diffusion models. The code and dataset are here: <a class="link-external link-https" href="https://github.com/RossoneriZhao/iced_coke" rel="external noopener nofollow">this https URL</a>.

Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the Latent Concept Misalignment (LC - Mis) that occurs in text - to - image diffusion models. Specifically, when the model tries to generate a combination containing two separate concepts, for example, given the prompt "a teacup with ice - cold cola", existing models usually generate a glass with ice - cold cola because "ice - cold cola" usually co - occurs with "glass" rather than "teacup" in the training data. The root cause of this alignment error lies in the confusion in the latent semantic space of text - to - image diffusion models. Therefore, the authors propose the LC - Mis phenomenon and use large - language models (LLMs) to thoroughly investigate the scope of LC - Mis and develop an automated pipeline to align the latent semantics of the diffusion model with the text prompt. The main contributions of the paper include: 1. Investigating the overlooked problem of latent concept alignment error (LC - Mis) in existing text - to - image diffusion models and introducing an LLM - based data - set collection pipeline. 2. Proposing a method of splitting the concepts in the text prompt and inputting them at different stages of the diffusion model generation process, which effectively alleviates the LC - Mis problem. Through these methods, the paper not only significantly reduces LC - Mis errors but also enhances the robustness and generality of text - to - image diffusion models.

Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models

RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Create Your World: Lifelong Text-to-Image Diffusion

Ablating Concepts in Text-to-Image Diffusion Models

The Hidden Language of Diffusion Models

Editing Massive Concepts in Text-to-Image Diffusion Models

Continuous Concepts Removal in Text-to-image Diffusion Models

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

Non-confusing Generation of Customized Concepts in Diffusion Models

Implicit Concept Removal of Diffusion Models

Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

MagicMix: Semantic Mixing with Diffusion Models

Explore In-Context Segmentation via Latent Diffusion Models