Abstract:The cross-speaker emotion transfer task in text-to-speech (TTS) synthesis particularly aims to synthesize speech for a target speaker with the emotion transferred from reference speech recorded by another (source) speaker. During the emotion transfer process, the identity information of the source speaker could also affect the synthesized results, resulting in the issue of speaker leakage, i.e., synthetic speech may have the voice identity of the source speaker rather than the target speaker. This paper proposes a new method with the aim to synthesize controllable emotional expressive speech and meanwhile maintain the target speakers identity in the cross-speaker emotion TTS task. The proposed method is a Tacotron2-based framework with emotion embedding as the conditioning variable to provide emotion information. Two emotion disentangling modules are contained in our method to 1) get speaker-irrelevant and emotion-discriminative embedding, and 2) explicitly constrain the emotion and speaker identity of synthetic speech to be that as expected. Moreover, we present an intuitive method to control the emotion strength in the synthetic speech for the target speaker. Specifically, the learned emotion embedding is adjusted with a flexible scalar value, which allows controlling the emotion strength conveyed by the embedding. Extensive experiments have been conducted on a Mandarin disjoint corpus, and the results demonstrate that the proposed method is able to synthesize reasonable emotional speech for the target speaker. Compared to the state-of-the-art reference embedding learned methods, our method gets the best performance on the cross-speaker emotion transfer task, indicating that our method achieves the new state-of-the-art performance on learning the speaker-irrelevant emotion embedding. Furthermore, the strength ranking test and pitch trajectories plots demonstrate that the proposed method can effectively control the emotion strength, leading to prosody-diverse synthetic speech.

Emphatic Speech Synthesis and Control Based on Characteristic Transferring in End-to-End Speech Synthesis

Controllable Emphatic Speech Synthesis Based on Forward Attention for Expressive Speech Synthesis

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

EE-TTS: Emphatic Expressive TTS with Linguistic Information

Synthesizing English Emphatic Speech for Multimodal Corrective Feedback in Computer-Aided Pronunciation Training.

Generating emphatic speech with hidden Markov model for expressive speech synthesis

Emphatic Speech Generation with Conditioned Input Layer and Bidirectional LSTMS for Expressive Speech Synthesis.

Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis

HMM-based Emphatic Speech Synthesis for Corrective Feedback in Computer-Aided Pronunciation Training

Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

Exemplar-Based Emotive Speech Synthesis

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

Focusing on attention: prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis

Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity

Hierarchical English Emphatic Speech Synthesis Based on HMM with Limited Training Data.

Emotional speech synthesis with rich and granularized control

Efficient Decoding Self-Attention for End-to-end Speech Synthesis

EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System

Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models

Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis