Abstract:Audio-visual speech synthesis is the core function for realizing face-to-face human-computer communication. While considerable efforts have been made to enable talking with computer like people, how to integrate the emotional expressions into the audio-visual speech synthesis remains largely a problem. In this paper, we adopt the notion of Pleasure-Displeasure, Arousal-Nonarousal, and Dominance-Submissiveness (PAD) 3-D-emotional space, in which emotions can be described and quantified from three different dimensions. Based on this new definition, we propose a unified model for emotional speech conversion using Boosting-Gaussian mixture model (GMM), as well as a facial expression synthesis model. We further present an emotional audio-visual speech synthesis approach. Specifically, we take the text and the target PAD values as input, and employ the text-to-speech (TTS) engine to first generate the neutral speeches. Then the Boosting-GMM is used to convert the neutral speeches to emotional speeches, and the facial expression is synthesized simultaneously. Finally, the acoustic features of the emotional speech are used to modulate the facial expression in the audio-visual speech. We designed three objective and five subjective experiments to evaluate the performance of each model and the overall approach. Our experimental results on audio-visual emotional speech datasets show that the proposed approach can effectively and efficiently synthesize natural and expressive emotional audio-visual speeches. Analysis on the results also unveil that the mutually reinforcing relationship indeed exists between audio and video information.

HMM-based Emphatic Speech Synthesis for Corrective Feedback in Computer-Aided Pronunciation Training

Synthesizing English Emphatic Speech for Multimodal Corrective Feedback in Computer-Aided Pronunciation Training.

Generating emphatic speech with hidden Markov model for expressive speech synthesis

Hierarchical English Emphatic Speech Synthesis Based on HMM with Limited Training Data.

Audiovisual Synthesis of Exaggerated Speech for Corrective Feedback in Computer-Assisted Pronunciation Training

Generating Emphasis from Neutral Speech Using Hierarchical Perturbation Model by Decision Tree and Support Vector Machine

Visual-speech Synthesis of Exaggerated Corrective Feedback

Controllable Emphatic Speech Synthesis Based on Forward Attention for Expressive Speech Synthesis

EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System

BiCAPT: Bidirectional Computer-Assisted Pronunciation Training with Normalizing Flows

Synthesizing Expressive Speech to Convey Focus using a Perturbation Model for Computer-Aided Pronunciation Training

Emphatic Speech Generation with Conditioned Input Layer and Bidirectional LSTMS for Expressive Speech Synthesis.

Emphatic Speech Synthesis and Control Based on Characteristic Transferring in End-to-End Speech Synthesis

HMM-based Speech Synthesis with a Flexible Mandarin Stress Adaptation Model

Detection and Emphatic Realization of Contrastive Word Pairs for Expressive Text-to-speech Synthesis

EE-TTS: Emphatic Expressive TTS with Linguistic Information

PTeacher: a Computer-Aided Personalized Pronunciation Training System with Exaggerated Audio-Visual Corrective Feedback

Emotional Audio-Visual Speech Synthesis Based on PAD

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

Hierarchical Stress Modeling and Generation in Mandarin for Expressive Text-to-Speech

Simplified Deformation Compensation for Emotional Speaker Recognition