EAT-Face: Emotion-Controllable Audio-Driven Talking Face Generation Via Diffusion Model

Haodi Wang,Xiaojun Jia,Xiaochun Cao
DOI: https://doi.org/10.1109/fg59268.2024.10581957
2024-01-01
Abstract:Audio-driven talking face generation is a promising task with a lot of attention. Despite abundant efforts are devoted to video quality and lip synchronization, most existing works do not take the unignorable aspect of facial emotional expression into account during generation. In this paper, we propose an Emotion-controllable Audio-driven Talking Face generation framework called EAT-Face that enables us to control multiple types of emotions. Specifically, the proposed method consists of a Talking Face Reconstructor (TFR) and a Facial Emotion Controller (FEC), utilizing fused multimodal information including audio signals, visual images, and textual emotions for synthesis. Firstly, TFR predicts face images synchronized with given audios from random noises, leveraging external guidances comprised of audio features, character references, and face masks as conditions. Then, FEC further manipulates the facial emotions based on TFR, leveraging the emotion embeddings extracted from emotion texts. However, a semantic misalignment problem lies in the emotion-texts and character images. To tackle this issue, we additionally propose a strategy called joint Emotion-Visual Embedding (EVE) to mitigate the misalignment. In this way, the proposed EAT-Face is captive to control emotion more precisely. Extensive experiments involving both objective evaluations and subjective investigations demonstrate the effectiveness of our framework in synthesizing high-fidelity and emotional talking face videos.
What problem does this paper attempt to address?