Abstract:We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively enhance the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scaling-augmented cross-identity images, ensuring maximized disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with consistently maintained identity characteristics.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper proposes an innovative method named **X-Portrait**, aiming to solve the problem of high-fidelity portrait animation generation. Specifically, given a static portrait image and a driving video containing different head poses and facial expressions, X-Portrait can generate high-quality, dynamic, and expressive portrait animations. #### Main Objectives: 1. **High-Fidelity Portrait Animation**: Generate portrait animations with rich dynamics and detailed facial expressions. 2. **Cross-Domain Adaptability**: Handle various styles of portraits and maintain consistent identity features under different driving videos. 3. **Unsupervised Training**: Achieve zero-shot animation generation through pre-trained diffusion models (e.g., Stable Diffusion). 4. **Fine-Grained Control**: Achieve precise control over head poses and facial expressions through novel control signal design, avoiding traditional coarse explicit control signals (e.g., facial keypoints). #### Key Technologies: - **Diffusion Model**: Utilize pre-trained diffusion models as the rendering backbone to achieve high-quality image synthesis. - **Implicit Motion Control**: Extract motion information directly from the RGB driving video instead of relying on keypoints or skeletons generated by third-party detectors. - **Local Motion Enhancement**: Introduce auxiliary ControlNet to enhance attention to local facial movements such as eyes and mouth, improving the realism of the animation. - **Identity Feature Preservation**: Reduce identity leakage in driving signals through cross-identity training schemes and random scaling operations, ensuring identity consistency in the generated animation. #### Innovations: 1. **Implicit Motion Control Scheme**: Avoid the limitations of explicit motion representation and dependency on third-party detectors by using cross-identity image generation. 2. **Local Motion Enhancement Module**: Enhance attention to subtle facial movements, improving the expressiveness of the animation. 3. **Cross-Identity Animation Capability**: Directly use driving videos of different identities during the inference stage without additional preprocessing. Through these technological innovations, X-Portrait demonstrates excellent performance on various portrait styles and complex driving sequences, generating animations with high visual fidelity, identity similarity, and motion accuracy.

X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Low tissue gastrin content in the ovine distal duodenum is associated with increased percentage of G34.

Performance-Driven Animation of Hand-Drawn Cartoon Faces

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

MegActor: Harness the Power of Raw Video for Vivid Portrait Animation

EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation

AniFaceDiff: Animating Stylized Avatars via Parametric Conditioned Diffusion Models

AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image Collections

MegActor-$Σ$: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

MyPortrait: Morphable Prior-Guided Personalized Portrait Generation

MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions