Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos

Foivos Paraperas Papantoniou,Panagiotis P. Filntisis,Petros Maragos,Anastasios Roussos

DOI: https://doi.org/10.48550/arXiv.2112.00585

2022-03-30

Abstract:In this paper, we introduce a novel deep learning method for photo-realistic manipulation of the emotional state of actors in "in-the-wild" videos. The proposed method is based on a parametric 3D face representation of the actor in the input scene that offers a reliable disentanglement of the facial identity from the head pose and facial expressions. It then uses a novel deep domain translation framework that alters the facial expressions in a consistent and plausible manner, taking into account their dynamics. Finally, the altered facial expressions are used to photo-realistically manipulate the facial region in the input scene based on an especially-designed neural face renderer. To the best of our knowledge, our method is the first to be capable of controlling the actor's facial expressions by even using as a sole input the semantic labels of the manipulated emotions, while at the same time preserving the speech-related lip movements. We conduct extensive qualitative and quantitative evaluations and comparisons, which demonstrate the effectiveness of our approach and the especially promising results that we obtain. Our method opens a plethora of new possibilities for useful applications of neural rendering technologies, ranging from movie post-production and video games to photo-realistic affective avatars.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to perform photo - realistic manipulation of actors' emotional states in "in - the - wild" videos while preserving voice - related lip movements. Specifically, existing technologies have serious limitations when changing facial emotions in videos, especially in making an actor with a neutral expression look happy and so on without using pre - existing clips. In addition, existing methods are usually unable to maintain the original voice synchronization while changing facial expressions. Therefore, this paper proposes a new deep - learning method, namely Neural Emotion Director (NED), which aims to control actors' facial expressions by using only the semantic labels of emotions as input while keeping the lip movements unchanged when speaking, thus solving the above problems. This method can not only transform facial performances into six basic emotions (anger, happiness, surprise, fear, disgust, sadness) plus neutral based on semantic labels, but also attach specific styles to target actors without the need for individual - specific training. This opens up new possibilities for applications such as film post - production, video games, and photo - realistic emotional avatars.

Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos

Emotional Speech-Driven Animation with Content-Emotion Disentanglement

Real-time Facial Expression Recognition "In The Wild'' by Disentangling 3D Expression from Identity

Towards Photo-Realistic Facial Expression Manipulation

EmoFace: Audio-driven Emotional 3D Face Animation

Neural Relighting and Expression Transfer On Video Portraits

Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting

Deep video portraits

Neural Voice Puppetry: Audio-driven Facial Reenactment

Visual Speech Emotion Conversion Using Deep Learning for 3D Talking Head

Voicing Your Emotion: Integrating Emotion and Identity in Cross-Modal 3D Facial Animations

EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

PIRenderer: Controllable Portrait Image Generation Via Semantic Neural Rendering

3D Facial Expressions through Analysis-by-Neural-Synthesis

Photorealistic and Identity-Preserving Image-Based Emotion Manipulation with Latent Diffusion Models

EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation

Video-Driven Neural Physically-Based Facial Asset for Production

3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head

DEITalk: Speech-Driven 3D Facial Animation with Dynamic Emotional Intensity Modeling

Expressive Speech-driven Facial Animation with controllable emotions

EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions