FlashFace: Human Image Personalization with High-fidelity Identity Preservation

Shilong Zhang,Lianghua Huang,Xi Chen,Yifei Zhang,Zhi-Fan Wu,Yutong Feng,Wei Wang,Yujun Shen,Yu Liu,Ping Luo
2024-03-26
Abstract:This work presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt. Our approach is distinguishable from existing human photo customization methods by higher-fidelity identity preservation and better instruction following, benefiting from two subtle designs. First, we encode the face identity into a series of feature maps instead of one image token as in prior arts, allowing the model to retain more details of the reference faces (e.g., scars, tattoos, and face shape ). Second, we introduce a disentangled integration strategy to balance the text and image guidance during the text-to-image generation process, alleviating the conflict between the reference faces and the text prompts (e.g., personalizing an adult into a "child" or an "elder"). Extensive experimental results demonstrate the effectiveness of our method on various applications, including human image personalization, face swapping under language prompts, making virtual characters into real people, etc. Project Page:
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to accurately follow text instructions while maintaining high - fidelity identity characteristics when generating personalized human images. Specifically, there are two main problems in existing methods during the generation process: 1. **Insufficient Facial Detail Preservation**: When encoding the reference face into one or several tokens, the spatial representation will be lost, resulting in the generated image being unable to well preserve the shape and details of the reference face. 2. **Imprecise Language Control**: Existing methods perform poorly when dealing with conflicts between text prompts and reference images. For example, when personalizing an adult as a "child" or an "elderly person", they usually cannot accurately follow the text prompts. To solve these problems, the paper proposes the **FlashFace** method, with the following two key technological improvements: 1. **Feature Map Encoding**: Use the reference network to encode the reference image into a series of feature maps instead of a single token, thereby preserving more facial detail information. 2. **Decoupled Integration Strategy**: Introduce a decoupled integration strategy to separately process the control signals of the reference image and the text prompt in the U - Net, balance the influence of the two, and improve the ability to accurately follow text instructions. These improvements make FlashFace perform excellently in various applications, including personalized human image generation, face - swapping based on language prompts, converting virtual characters into real people, etc.