A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation

Louis Airale,Dominique Vaufreydaz,Xavier Alameda-Pineda

2023-07-04

Abstract:Animating still face images with deep generative models using a speech input signal is an active research topic and has seen important recent progress. However, much of the effort has been put into lip syncing and rendering quality while the generation of natural head motion, let alone the audio-visual correlation between head motion and speech, has often been neglected. In this work, we propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN to better handle short and long-term correlation between speech and the dynamics of the head and lips. In particular, we train a stack of syncer models on multimodal input pyramids and use these models as guidance in a multi-scale generator network to produce audio-aligned motion unfolding over diverse time scales. Our generator operates in the facial landmark domain, which is a standard low-dimensional head representation. The experiments show significant improvements over the state of the art in head motion dynamics quality and in multi-scale audio-visual synchrony both in the landmark domain and in the image domain.

Graphics,Computer Vision and Pattern Recognition,Machine Learning,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

### The Problem Addressed by the Paper This paper aims to address several key issues in animating static facial images using speech input signals, particularly in the task of Talking Head Generation, focusing on natural head movements and audio-visual synchronization. #### Main Issues: 1. **Natural Head Movement Generation**: Many current studies primarily focus on lip synchronization and rendering quality, often neglecting the generation of natural head movements and their audio-visual association with speech. 2. **Multi-Scale Audio-Visual Synchronization**: While progress has been made in lip synchronization over short segments, there is insufficient handling of the low-frequency correlation between head movements and speech over longer periods. 3. **Comprehensive Synchronization Loss Function**: Existing loss functions are mostly optimized for lip synchronization and fail to adequately consider the synchronization relationship between head movements and speech at different time scales. ### Solutions To address these issues, the authors propose the following methods: 1. **Multi-Scale Audio-Visual Synchronization Loss**: Constructing a multi-scale synchronizer pyramid to evaluate the audio-visual synchronization effects at different time scales. 2. **Multi-Scale Autoregressive Generative Adversarial Network (GAN)**: Utilizing a multi-scale structured autoregressive GAN to generate head and lip movements synchronized with speech while reducing error accumulation. 3. **Experimental Validation**: Extensive experiments were conducted on multiple datasets, demonstrating the superior performance of the proposed method in terms of head movement quality and multi-scale audio-visual synchronization.

A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation

Audio-driven Talking Face Video Generation with Natural Head Pose

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

Realistic talking face animation with speech-induced head motion

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

Multimodal Learning for Temporally Coherent Talking Face Generation with Articulator Synergy

TalkingFlow: Talking Facial Landmark Generation with Multi-Scale Normalizing Flow Network

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis

Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation

Autoregressive GAN for Semantic Unconditional Head Motion Generation

Realistic Speech-Driven Facial Animation with GANs

Spatially and Temporally Optimized Audio‐Driven Talking Face Generation

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models

Talking-head Generation with Rhythmic Head Motion

High-Fidelity and Freely Controllable Talking Head Video Generation

Predicting Personalized Head Movement From Short Video and Speech Signal