Abstract:Advanced diffusion-based Text-to-Image (T2I) models, such as the Stable Diffusion Model, have made significant progress in generating diverse and high-quality images using text prompts alone. However, when non-famous users require personalized image generation for their identities (IDs), the T2I models fail to accurately generate their ID-related images. The main problem is that pre-trained T2I models do not learn the mapping between the new ID prompts and their corresponding visual content. The previous methods either failed to accurately fit the face region or lost the interactive generative ability with other existing concepts in T2I models. In other words, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (``Eiffel Tower''), actions (``holding a basketball''), and facial attributes (``eyes closed''). In this paper, we focus on inserting accurate and interactive ID embedding into the Stable Diffusion Model for semantic-fidelity personalized generation. We address this challenge from two perspectives: face-wise region fitting and semantic-fidelity token optimization. Specifically, we first visualize the attention overfit problem and propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background. This key trick significantly enhances the ID accuracy and interactive generative ability with other existing concepts. Then, we optimize one ID representation as multiple per-stage tokens where each token contains two disentangled features. This expansion of the textual conditioning space improves semantic-fidelity control. Extensive experiments validate that our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.

What problem does this paper attempt to address?

The paper primarily addresses the issue of personalized generation in Text-to-Image (T2I) generation models, specifically how to insert new identity (ID) information into pre-trained T2I models to achieve high-quality, high-fidelity personalized image generation. The paper attempts to solve the following key problems: 1. **Attention Overfit**: - Previous methods (such as Textural Inversion and ProSpect) tend to fit the entire target image information during fine-tuning, not just the face region related to the identity. This leads to ID embeddings containing irrelevant background information, reducing identity accuracy and making it difficult to generate other concepts (e.g., scenes, actions, etc.). 2. **Limited Semantic-Fidelity**: - Even though some methods (like Celeb Basis) can improve the accuracy of identity mapping, they introduce too much facial prior knowledge, limiting the ability to control facial attributes such as expression changes. To address the above issues, the paper proposes two main technical contributions: 1. **Face-Wise Attention Loss**: - Through visual analysis of the attention overfit problem in previous methods, the paper proposes a new attention loss function to constrain attention to the face region rather than the entire image. This method can significantly improve the accuracy of identity recognition and enhance the interactive generation capability with other existing concepts (such as scenes, actions, etc.). 2. **Semantic-Fidelity Token Optimization**: - The paper also proposes an optimization strategy that extends an identity representation into multiple stages of tokens, with each token containing two decoupled features. This method expands the text conditional space, improving the ability to control various scenes, facial attributes, and actions. Through these technical means, the paper aims to achieve more accurate and interactive identity embeddings, capable of generating images with a wider range of scenes, facial attributes, and actions given a prompt, thereby significantly enhancing the quality and flexibility of personalized generation.

Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation

Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm

ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning

Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis

Inv-Adapter: ID Customization Generation via Image Inversion and Lightweight Adapter

IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

SSIE-Diffusion: Personalized Generative Model for Subject-Specific Image Editing

InstantID: Zero-shot Identity-Preserving Generation in Seconds

Dense-Face: Personalized Face Generation Model via Dense Annotation Prediction

Inserting Anybody in Diffusion Models via Celeb Basis

Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

Magic-Me: Identity-Specific Video Customized Diffusion

An Improved Method for Personalizing Diffusion Models

Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion

StableIdentity: Inserting Anybody into Anywhere at First Sight

Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization

DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization

If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection

Create Your World: Lifelong Text-to-Image Diffusion