Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation

Yang Li,Songlin Yang,Wei Wang,Jing Dong
2024-03-22
Abstract:Advanced diffusion-based Text-to-Image (T2I) models, such as the Stable Diffusion Model, have made significant progress in generating diverse and high-quality images using text prompts alone. However, when non-famous users require personalized image generation for their identities (IDs), the T2I models fail to accurately generate their ID-related images. The main problem is that pre-trained T2I models do not learn the mapping between the new ID prompts and their corresponding visual content. The previous methods either failed to accurately fit the face region or lost the interactive generative ability with other existing concepts in T2I models. In other words, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (``Eiffel Tower''), actions (``holding a basketball''), and facial attributes (``eyes closed''). In this paper, we focus on inserting accurate and interactive ID embedding into the Stable Diffusion Model for semantic-fidelity personalized generation. We address this challenge from two perspectives: face-wise region fitting and semantic-fidelity token optimization. Specifically, we first visualize the attention overfit problem and propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background. This key trick significantly enhances the ID accuracy and interactive generative ability with other existing concepts. Then, we optimize one ID representation as multiple per-stage tokens where each token contains two disentangled features. This expansion of the textual conditioning space improves semantic-fidelity control. Extensive experiments validate that our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the issue of personalized generation in Text-to-Image (T2I) generation models, specifically how to insert new identity (ID) information into pre-trained T2I models to achieve high-quality, high-fidelity personalized image generation. The paper attempts to solve the following key problems: 1. **Attention Overfit**: - Previous methods (such as Textural Inversion and ProSpect) tend to fit the entire target image information during fine-tuning, not just the face region related to the identity. This leads to ID embeddings containing irrelevant background information, reducing identity accuracy and making it difficult to generate other concepts (e.g., scenes, actions, etc.). 2. **Limited Semantic-Fidelity**: - Even though some methods (like Celeb Basis) can improve the accuracy of identity mapping, they introduce too much facial prior knowledge, limiting the ability to control facial attributes such as expression changes. To address the above issues, the paper proposes two main technical contributions: 1. **Face-Wise Attention Loss**: - Through visual analysis of the attention overfit problem in previous methods, the paper proposes a new attention loss function to constrain attention to the face region rather than the entire image. This method can significantly improve the accuracy of identity recognition and enhance the interactive generation capability with other existing concepts (such as scenes, actions, etc.). 2. **Semantic-Fidelity Token Optimization**: - The paper also proposes an optimization strategy that extends an identity representation into multiple stages of tokens, with each token containing two decoupled features. This method expands the text conditional space, improving the ability to control various scenes, facial attributes, and actions. Through these technical means, the paper aims to achieve more accurate and interactive identity embeddings, capable of generating images with a wider range of scenes, facial attributes, and actions given a prompt, thereby significantly enhancing the quality and flexibility of personalized generation.