Abstract:Emotional text-to-speech synthesis (TTS) aims to generate realistic emotional speech from input text. However, quantitatively controlling multi-level emotion rendering remains challenging. In this paper, we propose a diffusion-based emotional TTS framework with a novel approach for emotion intensity modeling to facilitate fine-grained control over emotion rendering at the phoneme, word, and utterance levels. We introduce a hierarchical emotion distribution (ED) extractor that captures a quantifiable ED embedding across different speech segment levels. Additionally, we explore various acoustic features and assess their impact on emotion intensity modeling. During TTS training, the hierarchical ED embedding effectively captures the variance in emotion intensity from the reference audio and correlates it with linguistic and speaker information. The TTS model not only generates emotional speech during inference, but also quantitatively controls the emotion rendering over the speech constituents. Both objective and subjective evaluations demonstrate the effectiveness of our framework in terms of speech quality, emotional expressiveness, and hierarchical emotion control.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the challenge of quantitatively controlling multi - level emotion rendering in Emotional Text - to - Speech (TTS). Specifically, traditional emotional TTS systems usually regard emotion as a global attribute, and it is difficult to achieve fine - grained control of emotion intensity, especially at different levels such as phonemes, words, and sentences. ### Main Problems 1. **Multi - level Control of Emotion Rendering**: Existing emotional TTS systems, when dealing with emotions, often can only provide an average emotional style and lack flexible control of emotion intensity. This results in the generated speech having less delicate and natural emotional expressions. 2. **Quantitative Modeling of Emotion Intensity**: How to precisely control emotion intensity in a quantitative manner so that users can adjust emotional expressions at different levels is an urgent problem to be solved. 3. **Capturing of Multi - level Emotion Distribution**: A method is needed to effectively capture and represent the emotion distribution in different speech segments (such as phonemes, words, sentences), thereby achieving more fine - grained emotion control. ### Solutions To solve the above problems, the author proposes an emotional TTS framework based on the diffusion model and introduces the following innovations: 1. **Hierarchical Emotion Distribution (ED) Extraction Module**: This module can extract quantified emotion distribution embeddings (ED embeddings) from different levels (phonemes, words, sentences), thereby achieving fine - grained control of emotion rendering. 2. **Emotion Intensity Modeling**: By exploring different acoustic features and their influence on emotion intensity modeling, a new emotion intensity modeling method is proposed, which enhances the control ability of emotion intensity. 3. **Learning of Emotional Information during the Training Process**: During the training process, the hierarchical ED embeddings can effectively capture the changes in emotion intensity in the reference audio and associate them with language and speaker information, thereby achieving quantitative control of emotion rendering in the inference stage. ### Experimental Verification To verify the effectiveness of the proposed method, the author conducted comprehensive experiments, including objective and subjective evaluations. The results show that this framework performs well in terms of speech quality, emotional expressiveness, and hierarchical emotion control. ### Formula Representation The formulas involved in the paper, such as the modified Softmax function for calculating emotion intensity, can be represented as: \[ s(z_i)=\frac{\alpha z_i}{\sum_{j = 0}^{K - 1}\alpha z_j}\] where \(\alpha\) is a constant value that controls the entropy of the softmax distribution, \(i, j\in\{0, 1,\cdots, K - 1\}\) are the indices of the output - layer nodes, and \(K\) is the number of classes (emotions). Through these improvements, this paper successfully solves the problem of quantitative control of multi - level emotion rendering in emotional TTS, providing new ideas and technical means for achieving more natural and flexible emotional speech synthesis.

Hierarchical Control of Emotion Rendering in Speech Synthesis

Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis

Fine-Grained Quantitative Emotion Editing for Speech Generation

Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions

Emotional speech synthesis with rich and granularized control

EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis

Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models

Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

Fine-Grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis

Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling

EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance

A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach

Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity