Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos

Foivos Paraperas Papantoniou,Panagiotis P. Filntisis,Petros Maragos,Anastasios Roussos
DOI: https://doi.org/10.48550/arXiv.2112.00585
2022-03-30
Abstract:In this paper, we introduce a novel deep learning method for photo-realistic manipulation of the emotional state of actors in "in-the-wild" videos. The proposed method is based on a parametric 3D face representation of the actor in the input scene that offers a reliable disentanglement of the facial identity from the head pose and facial expressions. It then uses a novel deep domain translation framework that alters the facial expressions in a consistent and plausible manner, taking into account their dynamics. Finally, the altered facial expressions are used to photo-realistically manipulate the facial region in the input scene based on an especially-designed neural face renderer. To the best of our knowledge, our method is the first to be capable of controlling the actor's facial expressions by even using as a sole input the semantic labels of the manipulated emotions, while at the same time preserving the speech-related lip movements. We conduct extensive qualitative and quantitative evaluations and comparisons, which demonstrate the effectiveness of our approach and the especially promising results that we obtain. Our method opens a plethora of new possibilities for useful applications of neural rendering technologies, ranging from movie post-production and video games to photo-realistic affective avatars.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform photo - realistic manipulation of actors' emotional states in "in - the - wild" videos while preserving voice - related lip movements. Specifically, existing technologies have serious limitations when changing facial emotions in videos, especially in making an actor with a neutral expression look happy and so on without using pre - existing clips. In addition, existing methods are usually unable to maintain the original voice synchronization while changing facial expressions. Therefore, this paper proposes a new deep - learning method, namely Neural Emotion Director (NED), which aims to control actors' facial expressions by using only the semantic labels of emotions as input while keeping the lip movements unchanged when speaking, thus solving the above problems. This method can not only transform facial performances into six basic emotions (anger, happiness, surprise, fear, disgust, sadness) plus neutral based on semantic labels, but also attach specific styles to target actors without the need for individual - specific training. This opens up new possibilities for applications such as film post - production, video games, and photo - realistic emotional avatars.