Abstract:Audio-visual speech synthesis is the core function for realizing face-to-face human-computer communication. While considerable efforts have been made to enable talking with computer like people, how to integrate the emotional expressions into the audio-visual speech synthesis remains largely a problem. In this paper, we adopt the notion of Pleasure-Displeasure, Arousal-Nonarousal, and Dominance-Submissiveness (PAD) 3-D-emotional space, in which emotions can be described and quantified from three different dimensions. Based on this new definition, we propose a unified model for emotional speech conversion using Boosting-Gaussian mixture model (GMM), as well as a facial expression synthesis model. We further present an emotional audio-visual speech synthesis approach. Specifically, we take the text and the target PAD values as input, and employ the text-to-speech (TTS) engine to first generate the neutral speeches. Then the Boosting-GMM is used to convert the neutral speeches to emotional speeches, and the facial expression is synthesized simultaneously. Finally, the acoustic features of the emotional speech are used to modulate the facial expression in the audio-visual speech. We designed three objective and five subjective experiments to evaluate the performance of each model and the overall approach. Our experimental results on audio-visual emotional speech datasets show that the proposed approach can effectively and efficiently synthesize natural and expressive emotional audio-visual speeches. Analysis on the results also unveil that the mutually reinforcing relationship indeed exists between audio and video information.

A Model of Emotional Speech Generation Based on Conditional Generative Adversarial Networks

A Model of Emotional Speech Generation Based on Conditional Generative Adversarial Networks

An Emotional Dialogue System Using Conditional Generative Adversarial Networks with a Sequence-to-Sequence Transformer Encoder

Generation of Artificial FO-contours of Emotional Speech with Generative Adversarial Networks

Neural Conversation Generation with Auxiliary Emotional Supervised Models

Temporal conditional Wasserstein GANs for audio-visual affect-related ties

Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial Networks

Emotional Speech Generator by using Generative Adversarial Networks

ET-GAN: Cross-Language Emotion Transfer Based on Cycle-Consistent Generative Adversarial Networks

Emotional Neural Language Generation Grounded in Situational Contexts

EmoEden: Applying Generative Artificial Intelligence to Emotional Learning for Children with High-Function Autism

Improving Speech Emotion Recognition With Adversarial Data Augmentation Network

EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation

Emotional Audio-Visual Speech Synthesis Based on PAD

Data Augmentation Using Conditional GANs for Facial Emotion Recognition

Can Generative Agents Predict Emotion?

Prosody Analysis And Modeling For Emotional Speech Synthesis

The Good, The Bad, and Why: Unveiling Emotions in Generative AI

3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head

High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation