StableIdentity: Inserting Anybody into Anywhere at First Sight

Qinghe Wang,Xu Jia,Xiaomin Li,Taiqing Li,Liqian Ma,Yunzhi Zhuge,Huchuan Lu
2024-01-29
Abstract:Recent advances in large pretrained text-to-image models have shown unprecedented capabilities for high-quality human-centric generation, however, customizing face identity is still an intractable problem. Existing methods cannot ensure stable identity preservation and flexible editability, even with several images for each subject during training. In this work, we propose StableIdentity, which allows identity-consistent recontextualization with just one face image. More specifically, we employ a face encoder with an identity prior to encode the input face, and then land the face representation into a space with an editable prior, which is constructed from celeb names. By incorporating identity prior and editability prior, the learned identity can be injected anywhere with various contexts. In addition, we design a masked two-phase diffusion loss to boost the pixel-level perception of the input face and maintain the diversity of generation. Extensive experiments demonstrate our method outperforms previous customization methods. In addition, the learned identity can be flexibly combined with the off-the-shelf modules such as ControlNet. Notably, to the best knowledge, we are the first to directly inject the identity learned from a single image into video/3D generation without finetuning. We believe that the proposed StableIdentity is an important step to unify image, video, and 3D customized generation models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses several key issues in the task of customized generation, particularly in the one-shot training setting for human faces. Specifically, the research aims to achieve the following goals: 1. **Stable Facial Identity Preservation**: Existing methods often fail to consistently maintain the identity features of the input face across different contexts. 2. **Flexible Editability**: Even when multiple images per person are used during training, existing methods cannot ensure stable identity preservation and flexible editing. 3. **Efficiency and Practicality**: Some methods require long optimization times or large datasets to train a general encoder, making it difficult to capture unique identity details. To address these issues, the paper proposes a method named StableIdentity. This method allows for identity-consistent recontextualization using only a single facial image. Specifically, StableIdentity employs a pre-trained facial recognition model as a face encoder to capture identity representations and utilizes celebrity names to construct an editable identity distribution space. Additionally, the paper designs a masked two-phase diffusion loss to enhance pixel-level perception of the input face and learn more stable facial identity features. In summary, the goal of this research is to improve the flexibility and efficiency of customized generation while ensuring the stability of identity features. StableIdentity is not only suitable for image-level customized generation but can also be seamlessly integrated with video and 3D generation models without the need for additional fine-tuning steps.