EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance

Zicheng Duan,Yuxuan Ding,Chenhui Gou,Ziqin Zhou,Ethan Smith,Lingqiao Liu
2024-11-24
Abstract:Zero-shot personalized image generation models aim to produce images that align with both a given text prompt and subject image, requiring the model to effectively incorporate both sources of guidance. However, existing methods often struggle to capture fine-grained subject details and frequently prioritize one form of guidance over the other, resulting in suboptimal subject encoding and an imbalance in the generated images. In this study, we uncover key insights into achieving high-quality balances on subject identity preservation and text-following, notably that 1) the design of the subject image encoder critically influences subject identity preservation, and 2) the text and subject guidance should take effect at different denoising stages. Building on these insights, we introduce a new approach, EZIGen, that employs two main components: a carefully crafted subject image encoder based on the pre-trained UNet of the Stable Diffusion model, following a process that balances the two guidances by separating their dominance stage and revisiting certain time steps to bootstrap subject transfer quality. Through these two components, EZIGen achieves state-of-the-art results on multiple personalized generation benchmarks with a unified model and 100 times less training data. Demo Page: <a class="link-external link-http" href="http://zichengduan.github.io/pages/EZIGen/index.html" rel="external noopener nofollow">this http URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in zero - shot personalized image generation, how to balance the guidance between text prompts and the main image while generating high - quality images, in order to achieve better main identity preservation and text consistency. Specifically, existing methods are often difficult to capture the detailed features of the main image, and tend to give priority to one guidance method (such as main identity or text description) during the generation process, resulting in an imbalance between main identity preservation and text consistency in the generated image. To this end, the authors proposed the EZIGen model to solve these problems by improving the design of the main image encoder and introducing a decoupling guidance mechanism. ### Main contributions 1. **Improving the design of the main image encoder**: The authors found that the design of the main image encoder has a significant impact on the identity preservation ability, and proposed an encoder based on the pre - trained Stable Diffusion UNet. By adding noise and the denoising process to extract the main features, the quality of the main representation is improved. 2. **Decoupling the generation process**: In order to better balance the main identity and text consistency, the authors divided the generation process into two stages: the sketch generation stage and the appearance transfer stage. The sketch generation stage generates rough sketch latent variables from text prompts, while the appearance transfer stage injects the encoded main details into the sketch latent variables, thus separating the two guidance signals. 3. **Iterative appearance transfer mechanism**: The authors observed that when the sketch latent variables have similar semantics to the main body, the effect of the appearance transfer process will be better. Therefore, they introduced an iterative generation scheme, which gradually improves the effect of appearance transfer by repeatedly converting the generated image back to an editable noise latent variable. 4. **Extension to personalized image editing**: The authors also extended their method to the personalized image editing task. By combining object masks and image inversion techniques, they achieved editing of specific areas while keeping the background unchanged. ### Experimental results The experimental results show that EZIGen has achieved state - of - the - art performance in multiple benchmark tests, especially in the personalized image generation task on the DreamBench dataset and the personalized image editing task on the DreamEdit dataset, with excellent performance. Specific indicators include CLIP - T scores, DINO scores, and human evaluation results, all indicating that EZIGen has achieved the best balance in main identity preservation and text consistency. ### Summary Through the above improvements, EZIGen not only performs excellently in generating high - quality images, but also achieves a better balance between main identity preservation and text consistency, providing a new solution for zero - shot personalized image generation and editing tasks.