RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

Haoxiang Shi,Jianzong Wang,Xulong Zhang,Ning Cheng,Jun Yu,Jing Xiao
2024-05-27
Abstract:lthough current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information decoupling capability, the generated speech cannot achieve fine-grained emotion intensity control and suffers from information leakage issues. In this paper, we propose an emotion transfer TTS model, which defines a remapping-based sorting method to model intra-class relative intensity information, combined with Mutual Information (MI) to decouple speaker and emotion information, and synthesizes expressive speeches with perceptible intensity differences. Experiments show that our model achieves fine-grained emotion control while preserving speaker information.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily addresses two key issues in emotional speech synthesis: 1. How to achieve fine-grained control of emotional intensity; 2. How to maintain the speaker's information unchanged during emotion transfer. To solve these problems, the authors propose an emotion transfer speech synthesis model named RSET (Remapping-based Sorting method for Emotion Transfer Speech Synthesis). The core contributions of this model are as follows: 1. **Fine-grained Emotional Intensity Perception and Control**: By introducing a remapping-based sorting method to model the relative emotional intensity within the same emotion category, fine-grained emotional intensity perception and control are achieved. This method can capture the differences in emotional intensity between different samples within the same emotion category. 2. **Information Decoupling and Speaker Information Consistency**: To improve the model's ability to separate speaker information and emotional information, the authors introduce a mutual information minimization mechanism and design a speaker consistency loss function to ensure that the generated speech retains the speaker's characteristics while maintaining the emotion. Specifically, the solutions proposed in the paper include the following aspects: - **Remapping-based Sorting Method**: First, a coarse-grained sorting process determines the relative positions between non-neutral and neutral emotion samples. On this basis, fine-grained emotional intensity modeling is achieved by quantifying and activating the distance between each sample and the average emotional intensity. - **Emotional Intensity Controller**: This controller includes an emotional intensity extractor and a fusion module, which can extract emotional information from reference audio and generate the final fusion embedding vector based on manually set emotional intensity adjustment values to guide the speech synthesizer in generating speech with the desired emotional intensity. - **Information Decoupling and Speaker Consistency**: By minimizing the mutual information between speaker information and emotional information and introducing a speaker consistency loss function, the model's ability to separate these two types of information is improved, and the consistency of speaker information in the generated speech is ensured. Experimental results show that the RSET model outperforms existing emotion-controllable speech synthesis methods in terms of speech quality, speaker similarity, and emotional accuracy, indicating that the model can effectively achieve emotional intensity control while maintaining high speech quality and speaker characteristic consistency.