EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

Haobin Tang,Xulong Zhang,Jianzong Wang,Ning Cheng,Jing Xiao

2023-06-01

Abstract:There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS model based on a diffusion probabilistic model and a pre-trained speech emotion recognition (SER) model used to extract emotion embedding. Mixed emotion synthesis is achieved by combining the noises predicted by diffusion model conditioned on different emotions during only one sampling process at the run-time. We further apply the Neutral and specific primary emotion mixed in varying degrees to control intensity. Experimental results validate the effectiveness of EmoMix for synthesizing mixed emotion and intensity control.

Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper primarily addresses two key issues in Emotional Text-To-Speech (TTS) technology: mixed emotion synthesis and emotion intensity control, and proposes a new framework called EmoMix. First, existing emotional speech synthesis methods usually can only handle a limited number of emotion types and perform poorly in emotion intensity control. To overcome these limitations, EmoMix adopts a method based on the denoising diffusion probabilistic model (DDPM), combined with a pre-trained emotion recognition model to extract emotion embeddings, thereby achieving the synthesis of specific emotion intensities or mixed emotions. Specifically, the main contributions of EmoMix include: 1. **High-dimensional emotion embeddings**: Utilizing a pre-trained Speech Emotion Recognition (SER) model to extract high-dimensional emotion embeddings from reference audio, which can handle both known and unknown emotion categories. 2. **Runtime emotion mixing**: By combining predicted noise under different emotional conditions during the sampling process, mixed emotions are generated without directly modeling mixed emotions. 3. **Emotion intensity control**: By mixing neutral emotions with target emotions in different proportions, the emotion intensity of the synthesized speech can be effectively controlled. Experimental results show that EmoMix performs excellently in synthesizing both single primary emotions and mixed emotions, effectively controlling emotion intensity while maintaining high-quality speech output. Additionally, compared to other baseline methods, EmoMix achieves better results in both subjective evaluations and objective metrics, especially showing good performance in handling unseen emotion categories.

EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance

Speech Synthesis with Mixed Emotions

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization

iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis Based on Disentanglement Between Prosody and Timbre

Hierarchical Control of Emotion Rendering in Speech Synthesis

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data

Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness

Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Controllable Multi-Speaker Emotional Speech Synthesis with Emotion Representation of High Generalization Capability

Prosody Analysis And Modeling For Emotional Speech Synthesis

RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model

Prosody Conversion from Neutral Speech to Emotional Speech.

Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis

ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis

Emotional speech synthesis with rich and granularized control