Abstract:The cross-speaker emotion transfer task in text-to-speech (TTS) synthesis particularly aims to synthesize speech for a target speaker with the emotion transferred from reference speech recorded by another (source) speaker. During the emotion transfer process, the identity information of the source speaker could also affect the synthesized results, resulting in the issue of speaker leakage, i.e., synthetic speech may have the voice identity of the source speaker rather than the target speaker. This paper proposes a new method with the aim to synthesize controllable emotional expressive speech and meanwhile maintain the target speakers identity in the cross-speaker emotion TTS task. The proposed method is a Tacotron2-based framework with emotion embedding as the conditioning variable to provide emotion information. Two emotion disentangling modules are contained in our method to 1) get speaker-irrelevant and emotion-discriminative embedding, and 2) explicitly constrain the emotion and speaker identity of synthetic speech to be that as expected. Moreover, we present an intuitive method to control the emotion strength in the synthetic speech for the target speaker. Specifically, the learned emotion embedding is adjusted with a flexible scalar value, which allows controlling the emotion strength conveyed by the embedding. Extensive experiments have been conducted on a Mandarin disjoint corpus, and the results demonstrate that the proposed method is able to synthesize reasonable emotional speech for the target speaker. Compared to the state-of-the-art reference embedding learned methods, our method gets the best performance on the cross-speaker emotion transfer task, indicating that our method achieves the new state-of-the-art performance on learning the speaker-irrelevant emotion embedding. Furthermore, the strength ranking test and pitch trajectories plots demonstrate that the proposed method can effectively control the emotion strength, leading to prosody-diverse synthetic speech.

Multi-speaker Multi-style Text-to-speech Synthesis with Single-speaker Single-style Training Data Scenarios

Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis

Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised Style Extractor and Hierarchical Modeling in Speech Synthesis

Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement

MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis

Towards Cross-speaker Reading Style Transfer on Audiobook Dataset

Msdtron: a high-capability multi-speaker speech synthesis system for diverse data using characteristic information

Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing

Towards Multi-Scale Style Control for Expressive Speech Synthesis

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning

Expressive TTS Training with Frame and Style Reconstruction Loss.

Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Multi-speaker Chinese news broadcasting system based on improved Tacotron2

Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Speaking Style Compensation on Synthetic Audio for Robust Keyword Spotting.

Exploring synthetic data for cross-speaker style transfer in style representation based TTS

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

Text-aware and Context-aware Expressive Audiobook Speech Synthesis