DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

Hanbo Cheng,Limin Lin,Chenyu Liu,Pengcheng Xia,Pengfei Hu,Jiefeng Ma,Jun Du,Jia Pan
2024-10-18
Abstract:Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (Dynamic frame Avatar With Non-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. These results highlight the considerable promise and potential impact of DAWN in the field of talking head video generation. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Our code will be publicly available at <a class="link-external link-https" href="https://github.com/Hanbo-Cheng/DAWN-pytorch" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate realistic and vivid "talking head" videos, that is, to generate high - quality dynamic head videos from a single portrait and a voice audio clip. Although diffusion models have made significant progress in talking head generation, most of the existing methods rely on autoregressive strategies and have the following problems: 1. **Limited use of context**: Autoregressive methods are difficult to fully utilize the context information outside the current generation step. 2. **Error accumulation**: As the video length increases, the error accumulation problem may occur during the generation process. 3. **Slow generation speed**: Autoregressive methods generate videos frame by frame, resulting in a slow generation speed. To solve these problems, the authors propose DAWN (Dynamic frame Avatar With Non - autoregressive diffusion), a method based on a non - autoregressive diffusion framework, aiming to achieve one - time generation of video sequences of arbitrary length. The main contributions of DAWN include: 1. **Proposing for the first time a non - autoregressive solution based on the diffusion model** for general talking head video generation, achieving faster inference speed and high - quality results. 2. **Decoupling lip, head and blink movements** to compensate for the extrapolation limitations of non - autoregressive strategies in long - video generation and enhance the time - modeling ability. 3. **Introducing a lightweight Pose and Blink generation network (PBNet)** specifically for generating natural head postures and blink sequences from audio in a non - autoregressive manner. 4. **Proposing a two - stage curriculum learning (TCL) strategy** to guide the model to generate accurate lip movements and precise posture / blink control through phased training, ensuring that the model has strong convergence and extrapolation capabilities. These improvements make DAWN perform excellently in generating high - quality, long - videos, especially in terms of lip - synchronization, naturalness of head postures and blinks.