X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

You Xie,Hongyi Xu,Guoxian Song,Chao Wang,Yichun Shi,Linjie Luo
2024-07-26
Abstract:We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively enhance the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scaling-augmented cross-identity images, ensuring maximized disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with consistently maintained identity characteristics.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper proposes an innovative method named **X-Portrait**, aiming to solve the problem of high-fidelity portrait animation generation. Specifically, given a static portrait image and a driving video containing different head poses and facial expressions, X-Portrait can generate high-quality, dynamic, and expressive portrait animations. #### Main Objectives: 1. **High-Fidelity Portrait Animation**: Generate portrait animations with rich dynamics and detailed facial expressions. 2. **Cross-Domain Adaptability**: Handle various styles of portraits and maintain consistent identity features under different driving videos. 3. **Unsupervised Training**: Achieve zero-shot animation generation through pre-trained diffusion models (e.g., Stable Diffusion). 4. **Fine-Grained Control**: Achieve precise control over head poses and facial expressions through novel control signal design, avoiding traditional coarse explicit control signals (e.g., facial keypoints). #### Key Technologies: - **Diffusion Model**: Utilize pre-trained diffusion models as the rendering backbone to achieve high-quality image synthesis. - **Implicit Motion Control**: Extract motion information directly from the RGB driving video instead of relying on keypoints or skeletons generated by third-party detectors. - **Local Motion Enhancement**: Introduce auxiliary ControlNet to enhance attention to local facial movements such as eyes and mouth, improving the realism of the animation. - **Identity Feature Preservation**: Reduce identity leakage in driving signals through cross-identity training schemes and random scaling operations, ensuring identity consistency in the generated animation. #### Innovations: 1. **Implicit Motion Control Scheme**: Avoid the limitations of explicit motion representation and dependency on third-party detectors by using cross-identity image generation. 2. **Local Motion Enhancement Module**: Enhance attention to subtle facial movements, improving the expressiveness of the animation. 3. **Cross-Identity Animation Capability**: Directly use driving videos of different identities during the inference stage without additional preprocessing. Through these technological innovations, X-Portrait demonstrates excellent performance on various portrait styles and complex driving sequences, generating animations with high visual fidelity, identity similarity, and motion accuracy.