MakeItTalk: Speaker-Aware Talking-Head Animation

Yang Zhou,Xintong Han,Eli Shechtman,Jose Echevarria,Evangelos Kalogerakis,Dingzeyu Li
DOI: https://doi.org/10.1145/3414685.3417774
2021-02-26
Abstract:We present a method that generates expressive talking heads from a single facial image with audio as the only input. In contrast to previous approaches that attempt to learn direct mappings from audio to raw pixels or points for creating talking faces, our method first disentangles the content and speaker information in the input audio signal. The audio content robustly controls the motion of lips and nearby facial regions, while the speaker information determines the specifics of facial expressions and the rest of the talking head dynamics. Another key component of our method is the prediction of facial landmarks reflecting speaker-aware dynamics. Based on this intermediate representation, our method is able to synthesize photorealistic videos of entire talking heads with full range of motion and also animate artistic paintings, sketches, 2D cartoon characters, Japanese mangas, stylized caricatures in a single unified framework. We present extensive quantitative and qualitative evaluation of our method, in addition to user studies, demonstrating generated talking heads of significantly higher quality compared to prior state-of-the-art.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
The main problem that the paper "MakeItTalk: Speaker - Aware Talking - Head Animation" attempts to solve is to generate high - quality talking - head animations. Specifically, the authors propose a deep - learning - based method that can generate expressive talking - head videos from a single facial image and audio input. Different from previous methods that directly map from audio to pixels, the method in this paper first decouples the content information and the speaker information in the audio signal. Among them, the content information is used to control the movement of the lips and the nearby area, while the speaker information determines the specific manifestation of facial expressions and other head dynamics. This method not only improves the authenticity and expressiveness of the generated animations but also can produce reasonable results on new faces and voices that have not been seen during the training process. The key contributions of the paper include: - Introducing a new deep - learning architecture for predicting facial landmarks from voice signals. These landmarks capture not only facial expressions but also the overall head pose. - Generating speaker - aware talking - head animations based on the decoupled voice content and speaker information, which is inspired by the progress in the voice - conversion field. - Proposing two landmark - based image synthesis methods, which are respectively applicable to non - realistic cartoon images and natural human face images. These methods can handle new faces and cartoon characters that have not been observed during training. - Proposing a series of quantitative indicators and conducting user studies to evaluate the effect of the talking - head animation method. Through these innovations, the paper aims to overcome the limitations of existing technologies in generating talking - head animations with strong realism and rich expressiveness, especially in the performance when dealing with unseen faces and voices.