Abstract:Zero-shot personalized image generation models aim to produce images that align with both a given text prompt and subject image, requiring the model to effectively incorporate both sources of guidance. However, existing methods often struggle to capture fine-grained subject details and frequently prioritize one form of guidance over the other, resulting in suboptimal subject encoding and an imbalance in the generated images. In this study, we uncover key insights into achieving high-quality balances on subject identity preservation and text-following, notably that 1) the design of the subject image encoder critically influences subject identity preservation, and 2) the text and subject guidance should take effect at different denoising stages. Building on these insights, we introduce a new approach, EZIGen, that employs two main components: a carefully crafted subject image encoder based on the pre-trained UNet of the Stable Diffusion model, following a process that balances the two guidances by separating their dominance stage and revisiting certain time steps to bootstrap subject transfer quality. Through these two components, EZIGen achieves state-of-the-art results on multiple personalized generation benchmarks with a unified model and 100 times less training data. Demo Page: <a class="link-external link-http" href="http://zichengduan.github.io/pages/EZIGen/index.html" rel="external noopener nofollow">this http URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is in zero - shot personalized image generation, how to balance the guidance between text prompts and the main image while generating high - quality images, in order to achieve better main identity preservation and text consistency. Specifically, existing methods are often difficult to capture the detailed features of the main image, and tend to give priority to one guidance method (such as main identity or text description) during the generation process, resulting in an imbalance between main identity preservation and text consistency in the generated image. To this end, the authors proposed the EZIGen model to solve these problems by improving the design of the main image encoder and introducing a decoupling guidance mechanism. ### Main contributions 1. **Improving the design of the main image encoder**: The authors found that the design of the main image encoder has a significant impact on the identity preservation ability, and proposed an encoder based on the pre - trained Stable Diffusion UNet. By adding noise and the denoising process to extract the main features, the quality of the main representation is improved. 2. **Decoupling the generation process**: In order to better balance the main identity and text consistency, the authors divided the generation process into two stages: the sketch generation stage and the appearance transfer stage. The sketch generation stage generates rough sketch latent variables from text prompts, while the appearance transfer stage injects the encoded main details into the sketch latent variables, thus separating the two guidance signals. 3. **Iterative appearance transfer mechanism**: The authors observed that when the sketch latent variables have similar semantics to the main body, the effect of the appearance transfer process will be better. Therefore, they introduced an iterative generation scheme, which gradually improves the effect of appearance transfer by repeatedly converting the generated image back to an editable noise latent variable. 4. **Extension to personalized image editing**: The authors also extended their method to the personalized image editing task. By combining object masks and image inversion techniques, they achieved editing of specific areas while keeping the background unchanged. ### Experimental results The experimental results show that EZIGen has achieved state - of - the - art performance in multiple benchmark tests, especially in the personalized image generation task on the DreamBench dataset and the personalized image editing task on the DreamEdit dataset, with excellent performance. Specific indicators include CLIP - T scores, DINO scores, and human evaluation results, all indicating that EZIGen has achieved the best balance in main identity preservation and text consistency. ### Summary Through the above improvements, EZIGen not only performs excellently in generating high - quality images, but also achieves a better balance between main identity preservation and text consistency, providing a new solution for zero - shot personalized image generation and editing tasks.

EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance

FaceChain: A Playground for Identity-Preserving Portrait Generation

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

DreamTuner: Single Image is Enough for Subject-Driven Generation

Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation

SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation

Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization

InstantID: Zero-shot Identity-Preserving Generation in Seconds

Imagine yourself: Tuning-Free Personalized Image Generation

Fine-gained Zero-shot Video Sampling

DisenDreamer: Subject-Driven Text-to-Image Generation with Sample-aware Disentangled Tuning

Diverse and Tailored Image Generation for Zero-shot Multi-label Classification

TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder

DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation

OneActor: Consistent Character Generation via Cluster-Conditioned Guidance

Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization