Abstract:Emotional text-to-speech synthesis (TTS) aims to generate realistic emotional speech from input text. However, quantitatively controlling multi-level emotion rendering remains challenging. In this paper, we propose a diffusion-based emotional TTS framework with a novel approach for emotion intensity modeling to facilitate fine-grained control over emotion rendering at the phoneme, word, and utterance levels. We introduce a hierarchical emotion distribution (ED) extractor that captures a quantifiable ED embedding across different speech segment levels. Additionally, we explore various acoustic features and assess their impact on emotion intensity modeling. During TTS training, the hierarchical ED embedding effectively captures the variance in emotion intensity from the reference audio and correlates it with linguistic and speaker information. The TTS model not only generates emotional speech during inference, but also quantitatively controls the emotion rendering over the speech constituents. Both objective and subjective evaluations demonstrate the effectiveness of our framework in terms of speech quality, emotional expressiveness, and hierarchical emotion control.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the challenge of quantitatively controlling multi - level emotion rendering in Emotional Text - to - Speech (TTS). Specifically, traditional emotional TTS systems usually regard emotion as a global attribute, and it is difficult to achieve fine - grained control of emotion intensity, especially at different levels such as phonemes, words, and sentences.
### Main Problems
1. **Multi - level Control of Emotion Rendering**: Existing emotional TTS systems, when dealing with emotions, often can only provide an average emotional style and lack flexible control of emotion intensity. This results in the generated speech having less delicate and natural emotional expressions.
2. **Quantitative Modeling of Emotion Intensity**: How to precisely control emotion intensity in a quantitative manner so that users can adjust emotional expressions at different levels is an urgent problem to be solved.
3. **Capturing of Multi - level Emotion Distribution**: A method is needed to effectively capture and represent the emotion distribution in different speech segments (such as phonemes, words, sentences), thereby achieving more fine - grained emotion control.
### Solutions
To solve the above problems, the author proposes an emotional TTS framework based on the diffusion model and introduces the following innovations:
1. **Hierarchical Emotion Distribution (ED) Extraction Module**: This module can extract quantified emotion distribution embeddings (ED embeddings) from different levels (phonemes, words, sentences), thereby achieving fine - grained control of emotion rendering.
2. **Emotion Intensity Modeling**: By exploring different acoustic features and their influence on emotion intensity modeling, a new emotion intensity modeling method is proposed, which enhances the control ability of emotion intensity.
3. **Learning of Emotional Information during the Training Process**: During the training process, the hierarchical ED embeddings can effectively capture the changes in emotion intensity in the reference audio and associate them with language and speaker information, thereby achieving quantitative control of emotion rendering in the inference stage.
### Experimental Verification
To verify the effectiveness of the proposed method, the author conducted comprehensive experiments, including objective and subjective evaluations. The results show that this framework performs well in terms of speech quality, emotional expressiveness, and hierarchical emotion control.
### Formula Representation
The formulas involved in the paper, such as the modified Softmax function for calculating emotion intensity, can be represented as:
\[ s(z_i)=\frac{\alpha z_i}{\sum_{j = 0}^{K - 1}\alpha z_j}\]
where \(\alpha\) is a constant value that controls the entropy of the softmax distribution, \(i, j\in\{0, 1,\cdots, K - 1\}\) are the indices of the output - layer nodes, and \(K\) is the number of classes (emotions).
Through these improvements, this paper successfully solves the problem of quantitative control of multi - level emotion rendering in emotional TTS, providing new ideas and technical means for achieving more natural and flexible emotional speech synthesis.