Abstract:The field of portrait image animation, driven by speech audio input, has experienced significant advancements in the generation of realistic and dynamic portraits. This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations within the framework of diffusion-based methodologies. Moving away from traditional paradigms that rely on parametric models for intermediate facial representations, our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module to enhance the precision of alignment between audio inputs and visual outputs, encompassing lip, expression, and pose motion. Our proposed network architecture seamlessly integrates diffusion-based generative models, a UNet-based denoiser, temporal alignment techniques, and a reference network. The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities. Through a comprehensive evaluation that incorporates both qualitative and quantitative analyses, our approach demonstrates obvious enhancements in image and video quality, lip synchronization precision, and motion diversity. Further visualization and access to the source code can be found at: <a class="link-external link-https" href="https://fudan-generative-vision.github.io/hallo" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper primarily aims to address the following two key issues: 1. **Synchronization and Coordination of Facial Movements Driven by Audio**: How to precisely synchronize facial movements (including lip movements, facial expressions, and head poses) with the input audio signal and ensure coordination among these movements. 2. **Creation of High-Quality and Visually Appealing Animations**: How to generate animations that are both high-fidelity and temporally coherent, making them not only visually pleasing but also accurately reflective of the speaker's characteristics. To tackle these challenges, the research team proposed an innovative approach called **Hallo**, a hierarchical audio-driven visual synthesis method. This approach leverages the framework of end-to-end diffusion models and introduces a new module to enhance the alignment accuracy between audio inputs and visual outputs (especially lip, expression, and pose movements). Additionally, this method can adaptively control the diversity of expressions and poses, achieving more personalized effects. Specifically, the study employs the following technical means: - **End-to-End Diffusion Model**: Generates high-quality dynamic portrait videos directly from image and audio segments without the need for intermediate representations or complex preprocessing steps. - **Hierarchical Audio-Driven Visual Synthesis Module**: Establishes correspondences between audio and visual features through a cross-attention mechanism, followed by adaptively weighted fusion of these cross-attention results. - **Network Architecture Integration**: Combines diffusion-based generative models, UNet denoisers, temporal alignment techniques for sequence coherence, and reference networks for guiding visual generation. Through a series of quantitative and qualitative analyses, the study demonstrates that the proposed method improves image and video quality, lip synchronization accuracy, and motion diversity. Furthermore, the study emphasizes that its code and sample data will be made available to the open-source community to facilitate further research in the field.

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

LinguaLinker: Audio-Driven Portraits Animation with Implicit Facial Control Enhancement

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

Sonic: Shifting Focus to Global Audio Perception in Portrait Animation

MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions

Audio-Driven Emotional Video Portraits

Audio-driven Talking Face Video Generation with Natural Head Pose

Animating Portrait Line Drawings from a Single Face Photo and a Speech Signal

LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis

EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions

Photorealistic Audio-driven Video Portraits

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Pose-Controllable 3D Facial Animation Synthesis using Hierarchical Audio-Vertex Attention

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical Diffusion for Audio-driven Talking Head Synthesis