Abstract:Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: <a class="link-external link-https" href="https://github.com/fudan-generative-vision/hallo3" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on several key challenges faced by existing portrait image animation techniques: 1. **Non - front - view processing**: Many current facial animation techniques have difficulties in handling portrait animations from side, top - down or low - angle views, because these techniques usually rely on front - facing, centered reference portrait images. 2. **Dynamic object rendering**: When generating video sequences, it is a challenge to handle the realistic movement of significant accessories related to the portrait (such as handheld smartphones, microphones or tightly - worn items). 3. **Static background assumption**: Existing methods often assume that the background is static, which limits their ability to generate realistic video effects in dynamic scenes, such as scenes with a bonfire in the foreground or a crowded street in the background. To address these challenges, the paper introduces a video generation model based on the pre - trained Diffusion Transformer (DiT), which is applied to the portrait image animation task for the first time. By introducing a new video backbone network, this model solves the inapplicability of previous U - Net methods in identity preservation, audio conditioning and video extrapolation. Specifically, the paper proposes the following solutions: 1. **Identity preservation**: A causal 3D Variational Auto - Encoder (3D VAE) combined with a multi - layer Transformer identity reference network is designed to ensure the consistency of facial identities in the video sequence. 2. **Speech - audio conditioning**: Through the cross - attention mechanism and the adaptive layer normalization strategy, a high - degree of alignment between speech - audio and facial expression dynamics is achieved, thereby achieving precise control during the inference process. 3. **Video extrapolation**: A long - term video extrapolation strategy is proposed, using motion frames as conditional information, where the last frame of each generated video is used as the input for the generation of subsequent segments. Through these methods, the paper shows experimental results on benchmark datasets and newly proposed in - the - wild datasets, demonstrating that this method is superior to previous methods in generating realistic portrait animations with diverse views, dynamic foregrounds and backgrounds.

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

FaceChain: A Playground for Identity-Preserving Portrait Generation

PV3D: A 3D Generative Model for Portrait Video Generation

Audio-driven Talking Face Video Generation with Natural Head Pose

Low tissue gastrin content in the ovine distal duodenum is associated with increased percentage of G34.

CapHuman: Capture Your Moments in Parallel Universes

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

Image-to-Video Generation via 3D Facial Dynamics

MyPortrait: Morphable Prior-Guided Personalized Portrait Generation

StyleAvatar: Real-time Photo-realistic Portrait Avatar from a Single Video

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image Collections

Animating Portrait Line Drawings from a Single Face Photo and a Speech Signal

Generating Animatable 3D Cartoon Faces from Single Portraits

Coherent3D: Coherent 3D Portrait Video Reconstruction via Triplane Fusion

VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

Video-Driven Neural Physically-Based Facial Asset for Production