Abstract:Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.

Machines Imitating Humans

Emotional Speaker Identification By Humans And Machines

Speech Synthesis with Mixed Emotions

Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation

Toward Synthesizing Expressive Mandarin Speech

Voice Cloning Using Artificial Intelligence and Machine Learning: A Review

Construction of virtual assistant based on basic emotions theory

EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control

Paralinguistic and spectral feature extraction for speech emotion classification using machine learning techniques

Vocal emotion of humanoid robots: a study from brain mechanism.

Conveying Emotions to Robots through Touch and Sound

A Mood Semantic Awareness Model for Emotional Interactive Robots

Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness

Prevalence and future prediction of type 2 diabetes mellitus in the Kingdom of Saudi Arabia: A systematic review of published studies.

Enhancing Human-Machine Interaction: Real-Time Emotion Recognition through Speech Analysis

Emotional Expression Detection in Spoken Language Employing Machine Learning Algorithms

Machine learning techniques for speech emotion recognition using paralinguistic acoustic features

AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

Prosody Analysis And Modeling For Emotional Speech Synthesis