Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

Michał Stypułkowski,Konstantinos Vougioukas,Sen He,Maciej Zięba,Stavros Petridis,Maja Pantic
2023-07-30
Abstract:Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos. Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis and their performance on image and video generation has surpassed that of other generative models. In this work, we present an autoregressive diffusion model that requires only one identity image and audio sequence to generate a video of a realistic talking human head. Our solution is capable of hallucinating head movements, facial expressions, such as blinks, and preserving a given background. We evaluate our model on two different datasets, achieving state-of-the-art results on both of them.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper attempts to address the issue of generating realistic talking face videos using diffusion models without the need for additional guiding videos or reference frames. Specifically, it aims to solve the following problems: 1. **Natural Head Movements and Facial Expressions**: Existing methods struggle to generate natural head movements and facial expressions, especially without additional guidance. 2. **Video Generation from a Single Identity Image and Audio Sequence**: The proposed method can generate a realistic talking face video using only one identity frame and an audio sequence. 3. **Avoiding Mode Collapse**: Traditional methods like Generative Adversarial Networks (GANs) are prone to mode collapse, resulting in monotonous generated samples. 4. **Maintaining Lip Sync**: Ensuring good lip synchronization during the generation process. By introducing a diffusion model-based approach, the paper addresses these issues and achieves state-of-the-art results on two datasets. This method not only generates natural expressions and head movements but also maintains good background consistency and lip synchronization.