MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

Seyeon Kim,Siyoon Jin,Jihye Park,Kihong Kim,Jiyoung Kim,Jisu Nam,Seungryong Kim
2024-03-28
Abstract:Conventional GAN-based models for talking head generation often suffer from limited quality and unstable training. Recent approaches based on diffusion models aimed to address these limitations and improve fidelity. However, they still face challenges, including extensive sampling times and difficulties in maintaining temporal consistency due to the high stochasticity of diffusion models. To overcome these challenges, we propose a novel motion-disentangled diffusion model for high-quality talking head generation, dubbed MoDiTalker. We introduce the two modules: audio-to-motion (AToM), designed to generate a synchronized lip motion from audio, and motion-to-video (MToV), designed to produce high-quality head video following the generated motion. AToM excels in capturing subtle lip movements by leveraging an audio attention mechanism. In addition, MToV enhances temporal consistency by leveraging an efficient tri-plane representation. Our experiments conducted on standard benchmarks demonstrate that our model achieves superior performance compared to existing models. We also provide comprehensive ablation studies and user study results.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of high-quality speaker avatar generation, specifically generating high-fidelity speaker avatar videos with lip movements synchronized to given audio input. Specifically, the paper proposes a new framework called MoDiTalker, designed to overcome several key challenges in existing methods: 1. **Limitations of traditional GAN models**: Existing methods based on Generative Adversarial Networks (GANs) often encounter issues such as instability and mode collapse during training. 2. **Shortcomings of diffusion models**: Although recent methods based on diffusion models have shown excellent performance in image generation tasks, they still face challenges such as long sample times and poor temporal consistency when dealing with speaker avatar generation. 3. **Capturing local lip movements**: Existing methods often struggle to capture subtle lip movements from globally embedded audio information. To address these challenges, MoDiTalker adopts a two-stage diffusion model architecture: - **Audio-to-Motion (AToM)**: This module generates synchronized lip movements from audio input, utilizing attention mechanisms to distinguish between lip-related and unrelated regions, thereby enhancing lip synchronization. - **Motion-to-Video (MToV)**: This module generates high-fidelity speaker avatar videos based on the generated facial motion sequences and improves temporal consistency while reducing inference time complexity through the Tri-plane Representation. Experimental results show that MoDiTalker significantly outperforms existing GAN and diffusion model methods on multiple benchmark datasets, achieving the best performance in terms of image quality, lip synchronization accuracy, and more. Additionally, user studies confirm MoDiTalker's superior performance in lip synchronization fidelity, identity preservation, and video quality.