Toward Any-to-Any Emotion Voice Conversion using Disentangled Diffusion Framework

Hsing-Hang Chou,Yun-Shao Lin,Ching-Chin Sung,Yu Tsao,Chi-Chun Lee
2024-09-20
Abstract:Emotional Voice Conversion (EVC) aims to modify the emotional expression of speech for various applications, such as human-machine interaction. Previous deep learning-based approaches using generative adversarial networks and autoencoder models have shown promise but suffer from quality degradation and limited emotion control. To address these issues, a novel diffusion-based EVC framework with disentangled loss and expressive guidance is proposed. Our method separates speaker and emotional features to maintain speech quality while enhancing emotional expressiveness. Tested on real-world and acted-out datasets, the approach achieved significant improvements in emotion classification accuracy for both in-the-wild and act-out datasets and showed reduced distortion compared to state-of-the-art models.
Audio and Speech Processing
What problem does this paper attempt to address?