Abstract:This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to build a multi - speaker expressive text - to - speech (TTS) system that can synthesize speech with multiple styles and emotions for a target speaker. Specifically, the paper proposes a new method based on contrastive learning to achieve cross - speaker style and emotion transfer. #### Main problems: 1. **Multi - speaker, multi - style and multi - emotion speech synthesis**: Current TTS systems, when generating speech with specific styles and emotions, are often only able to handle a single speaker or limited emotions and styles. This paper hopes to expand to multiple speakers and be able to flexibly generate speech with different styles and emotions. 2. **Decoupling speaker timbre, style and emotion**: In speech, speaker timbre, style and emotion are often intertwined, which makes it difficult to control these attributes separately. For example, changing the emotion of speech may inadvertently change the speaker's timbre. Therefore, a method is needed to effectively decouple these attributes. 3. **Using multi - domain data to improve data utilization efficiency**: In order to train a TTS system that can handle multiple styles and emotions, a large amount of diverse data is required. However, data labeled with styles and emotions are usually scarce, while unlabeled data are very abundant. How to effectively use these unlabeled data is also a challenge. #### Solutions: - **Contrastive Learning**: By constructing positive and negative sample pairs at the sentence level and category level, extract decoupled style, emotion and speaker representations. Contrastive learning can effectively learn the required features from the data while maintaining the distinction between these features. - **Semi - supervised training strategy**: Introduce a semi - supervised training strategy, using multi - domain data including style labels, emotion labels and a large amount of unlabeled data to improve data utilization efficiency. This can make full use of the abundant unlabeled data and improve the robustness and generalization ability of the model. - **Improved VITS model**: Integrate the learned decoupled representations into the improved VITS model to achieve high - quality multi - speaker, multi - style and multi - emotion speech synthesis. Through these methods, the paper shows that the proposed framework can generate diverse expressive speech in multi - language and multi - speaker scenarios, even if the target speaker does not have specific styles or emotions in the training data. ### Summary: This paper mainly solves the problem of style and emotion transfer in multi - speaker expressive speech synthesis. Through contrastive learning and semi - supervised training strategies, it effectively decouples style, emotion and speaker timbre, thereby improving the quality and diversity of synthesized speech.

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Disentangling Style and Speaker Attributes for TTS Style Transfer

Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis

CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis

Improving Speech Emotion Recognition with Unsupervised Speaking Style Transfer

MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis

Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised Style Extractor and Hierarchical Modeling in Speech Synthesis

Towards Multi-Scale Style Control for Expressive Speech Synthesis

Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer

Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency

Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition