Abstract:Customized text-to-image generation, which aims to learn user-specified concepts with a few images, has drawn significant attention recently. However, existing methods usually suffer from overfitting issues and entangle the subject-unrelated information (e.g., background and pose) with the learned concept, limiting the potential to compose concept into new scenes. To address these issues, we propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation. Unlike conventional methods that learn a single concept embedding from the given images, our DETEX represents each image using multiple word embeddings during training, i.e., a learnable image-shared subject embedding and several image-specific subject-unrelated embeddings. To decouple irrelevant attributes (i.e., background and pose) from the subject embedding, we further present several attribute mappers that encode each image as several image-specific subject-unrelated embeddings. To encourage these unrelated embeddings to capture the irrelevant information, we incorporate them with corresponding attribute words and propose a joint training strategy to facilitate the disentanglement. During inference, we only use the subject embedding for image generation, while selectively using image-specific embeddings to retain image-specified attributes. Extensive experiments demonstrate that the subject embedding obtained by our method can faithfully represent the target concept, while showing superior editability compared to the state-of-the-art methods. Our code will be made published available.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in customized text - to - image generation tasks, existing methods usually encounter over - fitting problems and entangle information irrelevant to the theme (such as background and pose) with the learned concepts, which limits the potential to combine concepts into new scenes. Specifically, the paper points out that when existing customized text - to - image generation methods learn user - specified concepts using a small number of images (usually 3 - 5), due to the limited training samples, the learned concepts inevitably contain some information irrelevant to the theme, such as image backgrounds, the pose and position of the main body, etc., which leads to a decline in editing ability. In addition, although some studies attempt to filter background information by using topic masks, the effect is not ideal because irrelevant interferences (such as blank backgrounds and poses) are still intertwined with the learned concepts. To solve these problems, the paper proposes DETEX (Decoupled Textual Embeddings for Customized Image Generation), a new method aimed at learning decoupled concept embeddings to achieve flexible customized text - to - image generation. The main innovations of DETEX are as follows: 1. **Decoupled concept embeddings**: Unlike traditional methods, DETEX does not learn a single concept embedding from a given image, but uses multiple word embeddings to represent each image, that is, a learnable image - shared topic embedding and several image - specific topic - unrelated embeddings. This method helps to decouple irrelevant attributes (such as background and pose) from the topic embedding. 2. **Attribute mappers**: To further decouple irrelevant attributes, the paper introduces several attribute mappers, which can encode each image into several image - specific topic - unrelated embeddings. By combining the corresponding attribute words (such as background and pose), a joint training strategy is proposed to promote decoupling. 3. **Selective use of embeddings**: In the inference stage, DETEX only uses the topic embedding for image generation, and can selectively use image - specific embeddings to retain image - specific attributes. This design enables the generated image to be both faithful to the target concept and highly editable. Through these techniques, DETEX can effectively decouple topic - related and - unrelated information, thereby increasing the diversity and editing flexibility of images while maintaining topic consistency. Experimental results show that DETEX is superior to existing state - of - the - art methods in both concept fidelity and editing flexibility.

Decoupled Textual Embeddings for Customized Image Generation

AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

DisenDreamer: Subject-Driven Text-to-Image Generation with Sample-aware Disentangled Tuning

ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

Attention Calibration for Disentangled Text-to-Image Personalization

DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation

DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation

MC$^2$: Multi-concept Guidance for Customized Multi-concept Generation

CusConcept: Customized Visual Concept Decomposition with Diffusion Models

Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

RealCustom++: Representing Images as Real-Word for Real-Time Customization

AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

Disentangling for Text-to-Image Generation

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

Customized Generation Reimagined: Fidelity and Editability Harmonized

Multi-Concept Customization of Text-to-Image Diffusion

Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition