EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

Haobin Tang,Xulong Zhang,Jianzong Wang,Ning Cheng,Jing Xiao
2023-06-01
Abstract:There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS model based on a diffusion probabilistic model and a pre-trained speech emotion recognition (SER) model used to extract emotion embedding. Mixed emotion synthesis is achieved by combining the noises predicted by diffusion model conditioned on different emotions during only one sampling process at the run-time. We further apply the Neutral and specific primary emotion mixed in varying degrees to control intensity. Experimental results validate the effectiveness of EmoMix for synthesizing mixed emotion and intensity control.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily addresses two key issues in Emotional Text-To-Speech (TTS) technology: mixed emotion synthesis and emotion intensity control, and proposes a new framework called EmoMix. First, existing emotional speech synthesis methods usually can only handle a limited number of emotion types and perform poorly in emotion intensity control. To overcome these limitations, EmoMix adopts a method based on the denoising diffusion probabilistic model (DDPM), combined with a pre-trained emotion recognition model to extract emotion embeddings, thereby achieving the synthesis of specific emotion intensities or mixed emotions. Specifically, the main contributions of EmoMix include: 1. **High-dimensional emotion embeddings**: Utilizing a pre-trained Speech Emotion Recognition (SER) model to extract high-dimensional emotion embeddings from reference audio, which can handle both known and unknown emotion categories. 2. **Runtime emotion mixing**: By combining predicted noise under different emotional conditions during the sampling process, mixed emotions are generated without directly modeling mixed emotions. 3. **Emotion intensity control**: By mixing neutral emotions with target emotions in different proportions, the emotion intensity of the synthesized speech can be effectively controlled. Experimental results show that EmoMix performs excellently in synthesizing both single primary emotions and mixed emotions, effectively controlling emotion intensity while maintaining high-quality speech output. Additionally, compared to other baseline methods, EmoMix achieves better results in both subjective evaluations and objective metrics, especially showing good performance in handling unseen emotion categories.