PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

Zhen Li,Mingdeng Cao,Xintao Wang,Zhongang Qi,Ming-Ming Cheng,Ying Shan
2023-12-08
Abstract:Recent advances in text-to-image generation have made remarkable progress in synthesizing realistic human photos conditioned on given text prompts. However, existing personalized generation methods cannot simultaneously satisfy the requirements of high efficiency, promising identity (ID) fidelity, and flexible text controllability. In this work, we introduce PhotoMaker, an efficient personalized text-to-image generation method, which mainly encodes an arbitrary number of input ID images into a stack ID embedding for preserving ID information. Such an embedding, serving as a unified ID representation, can not only encapsulate the characteristics of the same input ID comprehensively, but also accommodate the characteristics of different IDs for subsequent integration. This paves the way for more intriguing and practically valuable applications. Besides, to drive the training of our PhotoMaker, we propose an ID-oriented data construction pipeline to assemble the training data. Under the nourishment of the dataset constructed through the proposed pipeline, our PhotoMaker demonstrates better ID preservation ability than test-time fine-tuning based methods, yet provides significant speed improvements, high-quality generation results, strong generalization capabilities, and a wide range of applications. Our project page is available at <a class="link-external link-https" href="https://photo-maker.github.io/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Multimedia
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the challenges in existing personalized image generation methods in balancing efficiency, ID fidelity, and text controllability. Specifically, while current personalized generation methods can produce high-quality facial images, they often require a significant amount of time for fine-tuning and lack diversity and ID fidelity in the generated images. These issues limit the widespread use of these methods in practical applications. ### Main Contributions of the Paper 1. **Efficiency**: A new method named PhotoMaker is proposed, which can generate personalized high-quality facial images in a single forward pass without the need for time-consuming fine-tuning. 2. **ID Fidelity**: By stacking the embeddings of multiple input ID images (stacked ID embedding), PhotoMaker can better preserve the identity information of the input images during the generation process. 3. **Text Controllability**: By combining the stacked ID embedding with text embeddings, PhotoMaker provides stronger text controllability, allowing flexible changes to the attributes of the generated images (such as accessories, expressions, etc.). 4. **Diverse Applications**: In addition to basic attribute modifications, PhotoMaker supports bringing characters from artworks or old photos into reality and mixing features of different identities to generate new personalized IDs. ### Method Overview - **Stacked ID Embedding**: By stacking the embeddings of multiple input ID images together, a unified ID representation is formed. This embedding can comprehensively represent the features of the input ID and can accept any number of ID images as input. - **Data Construction Pipeline**: An automated pipeline is designed to construct an ID-oriented dataset containing a large number of ID images with different perspectives, attributes, and scenes to support the training of PhotoMaker. - **Cross-Attention Mechanism**: Utilizing the cross-attention mechanism in the diffusion model, the identity information in the stacked ID embedding is adaptively fused to generate high-quality personalized images. ### Experimental Results - **Quantitative Evaluation**: The identity fidelity and text consistency of the generated images are evaluated using metrics such as DINO, CLIP-I, and CLIP-T, showing that PhotoMaker performs excellently on these metrics. - **Qualitative Evaluation**: Through user studies and visual results, the advantages of PhotoMaker in generating high-quality images, maintaining identity features, and text controllability are demonstrated. - **Application Demonstrations**: The application effects of PhotoMaker in recontextualization, bringing characters from artworks or old photos into reality, and mixing different identity features are showcased. In summary, this paper proposes an efficient, high-fidelity, and controllable personalized image generation method, providing new possibilities for practical applications.