Abstract:The cross-speaker emotion transfer task in text-to-speech (TTS) synthesis particularly aims to synthesize speech for a target speaker with the emotion transferred from reference speech recorded by another (source) speaker. During the emotion transfer process, the identity information of the source speaker could also affect the synthesized results, resulting in the issue of speaker leakage, i.e., synthetic speech may have the voice identity of the source speaker rather than the target speaker. This paper proposes a new method with the aim to synthesize controllable emotional expressive speech and meanwhile maintain the target speakers identity in the cross-speaker emotion TTS task. The proposed method is a Tacotron2-based framework with emotion embedding as the conditioning variable to provide emotion information. Two emotion disentangling modules are contained in our method to 1) get speaker-irrelevant and emotion-discriminative embedding, and 2) explicitly constrain the emotion and speaker identity of synthetic speech to be that as expected. Moreover, we present an intuitive method to control the emotion strength in the synthetic speech for the target speaker. Specifically, the learned emotion embedding is adjusted with a flexible scalar value, which allows controlling the emotion strength conveyed by the embedding. Extensive experiments have been conducted on a Mandarin disjoint corpus, and the results demonstrate that the proposed method is able to synthesize reasonable emotional speech for the target speaker. Compared to the state-of-the-art reference embedding learned methods, our method gets the best performance on the cross-speaker emotion transfer task, indicating that our method achieves the new state-of-the-art performance on learning the speaker-irrelevant emotion embedding. Furthermore, the strength ranking test and pitch trajectories plots demonstrate that the proposed method can effectively control the emotion strength, leading to prosody-diverse synthetic speech.

Zero-Shot Emotion Transfer For Cross-Lingual Speech Synthesis

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Self-attention Transfer Networks for Speech Emotion Recognition

Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis

METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer

Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis

iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis Based on Disentanglement Between Prosody and Timbre

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis

Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation

Controllable Multi-Speaker Emotional Speech Synthesis with Emotion Representation of High Generalization Capability

Zero-shot Cross-lingual Voice Transfer for TTS

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

ET-GAN: Cross-Language Emotion Transfer Based on Cycle-Consistent Generative Adversarial Networks

Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis

Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin

EMOTION CONTROLLABLE SPEECH SYNTHESIS USING EMOTION-UNLABELED DATASET WITH THE ASSISTANCE OF CROSS-DOMAIN SPEECH EMOTION RECOGNITION

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts