Abstract:Conventional GAN-based models for talking head generation often suffer from limited quality and unstable training. Recent approaches based on diffusion models aimed to address these limitations and improve fidelity. However, they still face challenges, including extensive sampling times and difficulties in maintaining temporal consistency due to the high stochasticity of diffusion models. To overcome these challenges, we propose a novel motion-disentangled diffusion model for high-quality talking head generation, dubbed MoDiTalker. We introduce the two modules: audio-to-motion (AToM), designed to generate a synchronized lip motion from audio, and motion-to-video (MToV), designed to produce high-quality head video following the generated motion. AToM excels in capturing subtle lip movements by leveraging an audio attention mechanism. In addition, MToV enhances temporal consistency by leveraging an efficient tri-plane representation. Our experiments conducted on standard benchmarks demonstrate that our model achieves superior performance compared to existing models. We also provide comprehensive ablation studies and user study results.

What problem does this paper attempt to address?

The paper aims to address the problem of high-quality speaker avatar generation, specifically generating high-fidelity speaker avatar videos with lip movements synchronized to given audio input. Specifically, the paper proposes a new framework called MoDiTalker, designed to overcome several key challenges in existing methods: 1. **Limitations of traditional GAN models**: Existing methods based on Generative Adversarial Networks (GANs) often encounter issues such as instability and mode collapse during training. 2. **Shortcomings of diffusion models**: Although recent methods based on diffusion models have shown excellent performance in image generation tasks, they still face challenges such as long sample times and poor temporal consistency when dealing with speaker avatar generation. 3. **Capturing local lip movements**: Existing methods often struggle to capture subtle lip movements from globally embedded audio information. To address these challenges, MoDiTalker adopts a two-stage diffusion model architecture: - **Audio-to-Motion (AToM)**: This module generates synchronized lip movements from audio input, utilizing attention mechanisms to distinguish between lip-related and unrelated regions, thereby enhancing lip synchronization. - **Motion-to-Video (MToV)**: This module generates high-fidelity speaker avatar videos based on the generated facial motion sequences and improves temporal consistency while reducing inference time complexity through the Tri-plane Representation. Experimental results show that MoDiTalker significantly outperforms existing GAN and diffusion model methods on multiple benchmark datasets, achieving the best performance in terms of image quality, lip synchronization accuracy, and more. Additionally, user studies confirm MoDiTalker's superior performance in lip synchronization fidelity, identity preservation, and video quality.

MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Audio-driven Talking Face Video Generation with Natural Head Pose

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

DisCoHead: Audio-and-Video-Driven Talking Head Generation by Disentangled Control of Head Pose and Facial Expressions

FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models

DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation