Talking Face Generation Via Learning Semantic and Temporal Synchronous Landmarks

Aihua Zheng,Feixia Zhu,Hao Zhu,Mandi Luo,Ran He
DOI: https://doi.org/10.1109/icpr48806.2021.9412425
2021-01-01
Abstract:Given a speech clip and facial image, the goal of talking face generation is to synthesize a talking face video with accurate mouth synchronization and natural face motion. Recent progress has proven the effectiveness of the landmarks as the intermediate information during talking face generation. However, the large gap between audio and visual modalities makes the prediction of landmarks challenging and limits generation ability. This paper proposes a semantic and temporal synchronous landmark learning method for talking face generation. First, we propose to introduce a word detector to enforce richer semantic information. Then, we propose to preserve the temporal synchronization and consistency between landmarks and audio via the proposed temporal residual loss. Lastly, we employ a U-Net generation network with adaptive reconstruction loss to generate facial images for the predicted landmarks. Experimental results on two benchmark datasets LRW and GRID demonstrate the effectiveness of our model compared to the state-of-the-art methods of talking face generation.
What problem does this paper attempt to address?