Abstract:Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (Dynamic frame Avatar With Non-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. These results highlight the considerable promise and potential impact of DAWN in the field of talking head video generation. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Our code will be publicly available at <a class="link-external link-https" href="https://github.com/Hanbo-Cheng/DAWN-pytorch" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to generate realistic and vivid "talking head" videos, that is, to generate high - quality dynamic head videos from a single portrait and a voice audio clip. Although diffusion models have made significant progress in talking head generation, most of the existing methods rely on autoregressive strategies and have the following problems: 1. **Limited use of context**: Autoregressive methods are difficult to fully utilize the context information outside the current generation step. 2. **Error accumulation**: As the video length increases, the error accumulation problem may occur during the generation process. 3. **Slow generation speed**: Autoregressive methods generate videos frame by frame, resulting in a slow generation speed. To solve these problems, the authors propose DAWN (Dynamic frame Avatar With Non - autoregressive diffusion), a method based on a non - autoregressive diffusion framework, aiming to achieve one - time generation of video sequences of arbitrary length. The main contributions of DAWN include: 1. **Proposing for the first time a non - autoregressive solution based on the diffusion model** for general talking head video generation, achieving faster inference speed and high - quality results. 2. **Decoupling lip, head and blink movements** to compensate for the extrapolation limitations of non - autoregressive strategies in long - video generation and enhance the time - modeling ability. 3. **Introducing a lightweight Pose and Blink generation network (PBNet)** specifically for generating natural head postures and blink sequences from audio in a non - autoregressive manner. 4. **Proposing a two - stage curriculum learning (TCL) strategy** to guide the model to generate accurate lip movements and precise posture / blink control through phased training, ensuring that the model has strong convergence and extrapolation capabilities. These improvements make DAWN perform excellently in generating high - quality, long - videos, especially in terms of lip - synchronization, naturalness of head postures and blinks.

DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Audio-driven Talking Face Video Generation with Natural Head Pose

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models

DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models

DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head Video Generation

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis

Audio-Driven Talking Head Video Generation with Diffusion Model

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis

DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical Diffusion for Audio-driven Talking Head Synthesis

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation