PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

Hengjia Li,Haonan Qiu,Shiwei Zhang,Xiang Wang,Yujie Wei,Zekun Li,Yingya Zhang,Boxi Wu,Deng Cai
2024-11-26
Abstract:The current text-to-video (T2V) generation has made significant progress in synthesizing realistic general videos, but it is still under-explored in identity-specific human video generation with customized ID images. The key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following after the identity injection. Current video identity customization methods mainly rely on reconstructing given identity images on text-to-image models, which have a divergent distribution with the T2V model. This process introduces a tuning-inference gap, leading to dynamic and semantic degradation. To tackle this problem, we propose a novel framework, dubbed \textbf{PersonalVideo}, that applies direct supervision on videos synthesized by the T2V model to bridge the gap. Specifically, we introduce a learnable Isolated Identity Adapter to customize the specific identity non-intrusively, which does not comprise the original T2V model's abilities (e.g., motion dynamic and semantic following). With the non-reconstructive identity loss, we further employ simulated prompt augmentation to reduce overfitting by supervising generated results in more semantic scenarios, gaining good robustness even with only a single reference image available. Extensive experiments demonstrate our method's superiority in delivering high identity faithfulness while preserving the inherent video generation qualities of the original T2V model, outshining prior approaches. Notably, our PersonalVideo seamlessly integrates with pre-trained SD components, such as ControlNet and style LoRA, requiring no extra tuning overhead.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve face - video customization of specific identities in text - to - video (T2V) generation while maintaining high identity fidelity (ID - fidelity) without causing dynamic and semantic degradation. Specifically, current T2V generation techniques have made significant progress in synthesizing general realistic videos, but still face challenges when generating videos of specific identities using customized identity images. The main challenges are: 1. **Maintaining inherent motion dynamics and semantic following**: Existing video customization methods usually model a customized text - to - image (T2I) prior by using image reconstruction supervision during the adjustment process, and then inject it into the T2V model during the inference process to generate videos of specific identities. However, the distribution difference between the pre - trained T2V model and the pre - trained T2I model will lead to an adjustment - inference gap, resulting in dynamic and semantic degradation, making the generated videos appear static and unable to follow the given prompts. 2. **Inserting consistent high - fidelity identities**: For reconstruction - based video customization, the adjustment - inference gap also poses a challenge to identity fidelity. Since humans are very sensitive to facial features, higher fidelity and consistent identities are required in customized videos. In order to achieve identity consistency while maintaining the model's dynamics and semantics, traditional reconstruction - based video customization methods often require more images or even additional video inputs to avoid over - fitting, which brings great inconvenience to users. To solve these problems, the paper proposes a new framework - PersonalVideo, which aims to achieve high identity fidelity using only a small number of identity images and maintain the motion dynamics and semantic following ability of the original T2V model. This framework bridges the adjustment - inference gap by directly imposing supervision on the videos generated by the T2V model, introduces a non - reconstruction identity loss, and adopts a simulated prompt enhancement technique to reduce over - fitting, thus showing good robustness even when only a single reference image is available. In addition, an Isolated Identity Adapter is designed to inject identity information without compromising motion dynamics and semantic following.