Abstract:Emotional voice conversion(EVC) aims to convert the speaker's voice from one emotion state to another without changing the speaker and the voice content. In the early emotional voice conversion task, it is difficult to deal with simple fundamental frequency(F0) features, which is one of the most important features in emotional voice expression. In general, the linear conversion method is used when processing the discrete F0 features, which leads to the poor effect of the emotional voice conversion method only using the spectral(SP) features. In this study, we propose an emotional voice conversion system using a F0 feature conversion method based on neural network(NN) training of multi-dimensional log F0 features. This method can effectively process the F0 feature to achieve better emotional conversion effects. Meanwhile, the system uses a deep bidirectional long short-term memory(DBiLSTM) network to train the SP features to learn the context of the voice spectrum. The extraction of SP features helps us understand and reconstruct the timbre of speech signals. In the preprocessing, the improved Dynamic Time Warping(DTW) algorithm is used to improve the accuracy of speech frame alignment and further increase the quality of emotional voice conversion. Through these methods, the SP and F0 features of emotional voice can be converted at the same time. The experimental results show that the system has a good effect on emotional voice conversion.

One-shot Emotional Voice Conversion Based on Feature Separation

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training

Emotion-State conversion for speaker recognition

Multi-Target Emotional Voice Conversion With Neural Vocoders

Decoupling Speaker-Independent Emotions for Voice Conversion Via Source-Filter Networks

Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity

Mixed-EVC: Mixed Emotion Synthesis and Control in Voice Conversion

Emotion Intensity and its Control for Emotional Voice Conversion

Nonparallel Emotional Voice Conversion For Unseen Speaker-Emotion Pairs Using Dual Domain Adversarial Network & Virtual Domain Pairing

Emotional Voice Conversion With Cycle-consistent Adversarial Network

An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation

Toward Any-to-Any Emotion Voice Conversion using Disentangled Diffusion Framework

Natural-Emotion Gmm Transformation Algorithm For Emotional Speaker Recognition

VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech

EXPRESSIVE VOICE CONVERSION: A JOINT FRAMEWORK FOR SPEAKER IDENTITY AND EMOTIONAL STYLE TRANSFER

Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

Emotional voice conversion using DBiLSTM-NN with MFCC and LogF0 features

Unifying One-Shot Voice Conversion and Cloning with Disentangled Speech Representations

Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion

One-Shot Voice Conversion with Global Speaker Embeddings