Emotional voice conversion using DBiLSTM-NN with MFCC and LogF0 features

Cao, Danyang
DOI: https://doi.org/10.1007/s11042-024-19334-1
IF: 2.577
2024-05-17
Multimedia Tools and Applications
Abstract:Emotional voice conversion(EVC) aims to convert the speaker's voice from one emotion state to another without changing the speaker and the voice content. In the early emotional voice conversion task, it is difficult to deal with simple fundamental frequency(F0) features, which is one of the most important features in emotional voice expression. In general, the linear conversion method is used when processing the discrete F0 features, which leads to the poor effect of the emotional voice conversion method only using the spectral(SP) features. In this study, we propose an emotional voice conversion system using a F0 feature conversion method based on neural network(NN) training of multi-dimensional log F0 features. This method can effectively process the F0 feature to achieve better emotional conversion effects. Meanwhile, the system uses a deep bidirectional long short-term memory(DBiLSTM) network to train the SP features to learn the context of the voice spectrum. The extraction of SP features helps us understand and reconstruct the timbre of speech signals. In the preprocessing, the improved Dynamic Time Warping(DTW) algorithm is used to improve the accuracy of speech frame alignment and further increase the quality of emotional voice conversion. Through these methods, the SP and F0 features of emotional voice can be converted at the same time. The experimental results show that the system has a good effect on emotional voice conversion.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?