Generating Smooth and Facial-Details-Enhanced Talking Head Video: A Perspective of Pre and Post Processes

Tian Lv,Yu-Hui Wen,Zhiyao Sun,Zipeng Ye,Yong-Jin Liu
DOI: https://doi.org/10.1145/3503161.3551583
2022-01-01
Abstract:Talking head video generation has received increasing attention recently. So far the quality (especially the facial details) of the videos output from state-of-the-art deep learning methods is limited by either the quality of training data or the performance of generators, and needs to be further improved. In this paper, we propose a data pre- and post- processing strategy based on a key observation: generating talking head video from multi-modal input is a challenging problem and generating smooth video with fine facial details makes the problem even harder. Then we propose to decompose the problem solution into a main deep model, a pre- and a post- processing. The main deep model generates a reasonably good talking face video, with the aid of a pre-process, which also contributes to a post-process for restoring smooth and fine facial details in the final video. In particular, our main deep model reconstructs a 3D face from an input reference frame, and then uses an AudioNet to generate a sequence of facial expression coefficients with an input audio clip. To ensure final facial details in the generated video, we sample the original texture from the reference frame in the pre-process with the aid of reconstructed 3D face and a predefined UV map. Accordingly, in the post-process, we smooth the expression coefficients of adjacent frames to alleviate jitters and apply a pretrained face restoration module to recover the fine facial details. Experimental results and ablation study show the advantage of our proposed method.
What problem does this paper attempt to address?