Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models

Juntu Zhao,Junyu Deng,Yixin Ye,Chongxuan Li,Zhijie Deng,Dequan Wang
2024-08-05
Abstract:Advancements in text-to-image diffusion models have broadened extensive downstream practical applications, but such models often encounter misalignment issues between text and image. Taking the generation of a combination of two disentangled concepts as an example, say given the prompt "a tea cup of iced coke", existing models usually generate a glass cup of iced coke because the iced coke usually co-occurs with the glass cup instead of the tea one during model training. The root of such misalignment is attributed to the confusion in the latent semantic space of text-to-image diffusion models, and hence we refer to the "a tea cup of iced coke" phenomenon as Latent Concept Misalignment (LC-Mis). We leverage large language models (LLMs) to thoroughly investigate the scope of LC-Mis, and develop an automated pipeline for aligning the latent semantics of diffusion models to text prompts. Empirical assessments confirm the effectiveness of our approach, substantially reducing LC-Mis errors and enhancing the robustness and versatility of text-to-image diffusion models. The code and dataset are here: <a class="link-external link-https" href="https://github.com/RossoneriZhao/iced_coke" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the Latent Concept Misalignment (LC - Mis) that occurs in text - to - image diffusion models. Specifically, when the model tries to generate a combination containing two separate concepts, for example, given the prompt "a teacup with ice - cold cola", existing models usually generate a glass with ice - cold cola because "ice - cold cola" usually co - occurs with "glass" rather than "teacup" in the training data. The root cause of this alignment error lies in the confusion in the latent semantic space of text - to - image diffusion models. Therefore, the authors propose the LC - Mis phenomenon and use large - language models (LLMs) to thoroughly investigate the scope of LC - Mis and develop an automated pipeline to align the latent semantics of the diffusion model with the text prompt. The main contributions of the paper include: 1. Investigating the overlooked problem of latent concept alignment error (LC - Mis) in existing text - to - image diffusion models and introducing an LLM - based data - set collection pipeline. 2. Proposing a method of splitting the concepts in the text prompt and inputting them at different stages of the diffusion model generation process, which effectively alleviates the LC - Mis problem. Through these methods, the paper not only significantly reduces LC - Mis errors but also enhances the robustness and generality of text - to - image diffusion models.