Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion

Huaiping Ming,Dongyan Huang,Lei Xie,Jie Wu,Minghui Dong,Haizhou Li
DOI: https://doi.org/10.21437/interspeech.2016-1053
2016-01-01
Abstract:Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model timbre and prosody features using a deep bidirectional long shortterm memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform (CWT) representation of fundamental frequency (FO) and energy contour are used for prosody modeling. Specifically, we use CWT to decompose FO into a five-scale representation, and decompose energy contour into a ten-scale representation, where each feature scale corresponds to a temporal scale. Both spectrum and prosody (FO and energy contour) features are simultaneously converted by a sequence to sequence conversion method with DBLSTM model, which captures both frame-wise and long-range relationship between source and target voice. The converted speech signals are evaluated both objectively and subjectively, which confirms the effectiveness of the proposed method.
What problem does this paper attempt to address?