A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation

Louis Airale,Dominique Vaufreydaz,Xavier Alameda-Pineda
2023-07-04
Abstract:Animating still face images with deep generative models using a speech input signal is an active research topic and has seen important recent progress. However, much of the effort has been put into lip syncing and rendering quality while the generation of natural head motion, let alone the audio-visual correlation between head motion and speech, has often been neglected. In this work, we propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN to better handle short and long-term correlation between speech and the dynamics of the head and lips. In particular, we train a stack of syncer models on multimodal input pyramids and use these models as guidance in a multi-scale generator network to produce audio-aligned motion unfolding over diverse time scales. Our generator operates in the facial landmark domain, which is a standard low-dimensional head representation. The experiments show significant improvements over the state of the art in head motion dynamics quality and in multi-scale audio-visual synchrony both in the landmark domain and in the image domain.
Graphics,Computer Vision and Pattern Recognition,Machine Learning,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### The Problem Addressed by the Paper This paper aims to address several key issues in animating static facial images using speech input signals, particularly in the task of Talking Head Generation, focusing on natural head movements and audio-visual synchronization. #### Main Issues: 1. **Natural Head Movement Generation**: Many current studies primarily focus on lip synchronization and rendering quality, often neglecting the generation of natural head movements and their audio-visual association with speech. 2. **Multi-Scale Audio-Visual Synchronization**: While progress has been made in lip synchronization over short segments, there is insufficient handling of the low-frequency correlation between head movements and speech over longer periods. 3. **Comprehensive Synchronization Loss Function**: Existing loss functions are mostly optimized for lip synchronization and fail to adequately consider the synchronization relationship between head movements and speech at different time scales. ### Solutions To address these issues, the authors propose the following methods: 1. **Multi-Scale Audio-Visual Synchronization Loss**: Constructing a multi-scale synchronizer pyramid to evaluate the audio-visual synchronization effects at different time scales. 2. **Multi-Scale Autoregressive Generative Adversarial Network (GAN)**: Utilizing a multi-scale structured autoregressive GAN to generate head and lip movements synchronized with speech while reducing error accumulation. 3. **Experimental Validation**: Extensive experiments were conducted on multiple datasets, demonstrating the superior performance of the proposed method in terms of head movement quality and multi-scale audio-visual synchronization.