Abstract:Emotional voice conversion(EVC) aims to convert the speaker's voice from one emotion state to another without changing the speaker and the voice content. In the early emotional voice conversion task, it is difficult to deal with simple fundamental frequency(F0) features, which is one of the most important features in emotional voice expression. In general, the linear conversion method is used when processing the discrete F0 features, which leads to the poor effect of the emotional voice conversion method only using the spectral(SP) features. In this study, we propose an emotional voice conversion system using a F0 feature conversion method based on neural network(NN) training of multi-dimensional log F0 features. This method can effectively process the F0 feature to achieve better emotional conversion effects. Meanwhile, the system uses a deep bidirectional long short-term memory(DBiLSTM) network to train the SP features to learn the context of the voice spectrum. The extraction of SP features helps us understand and reconstruct the timbre of speech signals. In the preprocessing, the improved Dynamic Time Warping(DTW) algorithm is used to improve the accuracy of speech frame alignment and further increase the quality of emotional voice conversion. Through these methods, the SP and F0 features of emotional voice can be converted at the same time. The experimental results show that the system has a good effect on emotional voice conversion.

Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion

Self-attention Transfer Networks for Speech Emotion Recognition

Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity

Toward Any-to-Any Emotion Voice Conversion using Disentangled Diffusion Framework

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

One-shot Emotional Voice Conversion Based on Feature Separation

In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised Representations and Neural Vocoder-based Resynthesis

SEEN AND UNSEEN EMOTIONAL STYLE TRANSFER FOR VOICE CONVERSION WITH A NEW EMOTIONAL SPEECH DATASET

Decoupling Speaker-Independent Emotions for Voice Conversion Via Source-Filter Networks

Nonparallel Emotional Speech Conversion

Disentanglement Network: Disentangle the Emotional Features from Acoustic Features for Speech Emotion Recognition

Emotional voice conversion using DBiLSTM-NN with MFCC and LogF0 features

Emotion Intensity and its Control for Emotional Voice Conversion

Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training

EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

Emotional Voice Conversion With Cycle-consistent Adversarial Network

EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data

Inferring Emotion from Conversational Voice Data: A Semi-Supervised Multi-Path Generative Neural Network Approach.

Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features