Abstract:lthough current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information decoupling capability, the generated speech cannot achieve fine-grained emotion intensity control and suffers from information leakage issues. In this paper, we propose an emotion transfer TTS model, which defines a remapping-based sorting method to model intra-class relative intensity information, combined with Mutual Information (MI) to decouple speaker and emotion information, and synthesizes expressive speeches with perceptible intensity differences. Experiments show that our model achieves fine-grained emotion control while preserving speaker information.

What problem does this paper attempt to address?

The paper primarily addresses two key issues in emotional speech synthesis: 1. How to achieve fine-grained control of emotional intensity; 2. How to maintain the speaker's information unchanged during emotion transfer. To solve these problems, the authors propose an emotion transfer speech synthesis model named RSET (Remapping-based Sorting method for Emotion Transfer Speech Synthesis). The core contributions of this model are as follows: 1. **Fine-grained Emotional Intensity Perception and Control**: By introducing a remapping-based sorting method to model the relative emotional intensity within the same emotion category, fine-grained emotional intensity perception and control are achieved. This method can capture the differences in emotional intensity between different samples within the same emotion category. 2. **Information Decoupling and Speaker Information Consistency**: To improve the model's ability to separate speaker information and emotional information, the authors introduce a mutual information minimization mechanism and design a speaker consistency loss function to ensure that the generated speech retains the speaker's characteristics while maintaining the emotion. Specifically, the solutions proposed in the paper include the following aspects: - **Remapping-based Sorting Method**: First, a coarse-grained sorting process determines the relative positions between non-neutral and neutral emotion samples. On this basis, fine-grained emotional intensity modeling is achieved by quantifying and activating the distance between each sample and the average emotional intensity. - **Emotional Intensity Controller**: This controller includes an emotional intensity extractor and a fusion module, which can extract emotional information from reference audio and generate the final fusion embedding vector based on manually set emotional intensity adjustment values to guide the speech synthesizer in generating speech with the desired emotional intensity. - **Information Decoupling and Speaker Consistency**: By minimizing the mutual information between speaker information and emotional information and introducing a speaker consistency loss function, the model's ability to separate these two types of information is improved, and the consistency of speaker information in the generated speech is ensured. Experimental results show that the RSET model outperforms existing emotion-controllable speech synthesis methods in terms of speech quality, speaker similarity, and emotional accuracy, indicating that the model can effectively achieve emotional intensity control while maintaining high speech quality and speaker characteristic consistency.

RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

Self-attention Transfer Networks for Speech Emotion Recognition

Emotional Speaker Recognition Based on Model Space Migration through Translated Learning.

Emotion-State conversion for speaker recognition

Scores Selection for Emotional Speaker Recognition

Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis

Fine-Grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis

EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis Based on Disentanglement Between Prosody and Timbre

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer

Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness

Emotional speech synthesis with rich and granularized control

Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis

Cross-speaker Emotion Transfer by Manipulating Speech Style Latents

Emotion Controllable Speech Synthesis Using Emotion-Unlabeled Dataset with the Assistance of Cross-Domain Speech Emotion Recognition.

EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control