Abstract:Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.

EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System

Synthesizing English Emphatic Speech for Multimodal Corrective Feedback in Computer-Aided Pronunciation Training.

Controllable Emphatic Speech Synthesis Based on Forward Attention for Expressive Speech Synthesis

Emotional speaker recognition based on similar neighbor phenomenon

Prosody Analysis And Modeling For Emotional Speech Synthesis

HMM-based Emphatic Speech Synthesis for Corrective Feedback in Computer-Aided Pronunciation Training

Generating emphatic speech with hidden Markov model for expressive speech synthesis

HMM-based Speech Synthesis with a Flexible Mandarin Stress Adaptation Model

Emphatic Speech Generation with Conditioned Input Layer and Bidirectional LSTMS for Expressive Speech Synthesis.

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

An Emotional Text-Driven 3D Visual Pronunciation System for Mandarin Chinese

Exemplar-Based Emotive Speech Synthesis

A Realistic 3d Articulatory Animation System for Emotional Visual Pronunciation

Emotional Audio-Visual Speech Synthesis Based on PAD

Hierarchical English Emphatic Speech Synthesis Based on HMM with Limited Training Data.

Controllable Multi-Speaker Emotional Speech Synthesis with Emotion Representation of High Generalization Capability

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

EE-TTS: Emphatic Expressive TTS with Linguistic Information

Word-Level Emphasis Modelling in Hmm-Based Speech Synthesis