Spectro-Temporal Modelling with Time-Frequency Lstm and Structured Output Layer for Voice Conversion

Runnan Li,Zhiyong Wu,Yishuang Ning,Lifa Sun,Helen Meng,Lianhong Cai
DOI: https://doi.org/10.21437/interspeech.2017-1122
2017-01-01
Abstract:From speech, speaker identity can be, mostly characterized by the spectro-temporal structures of spectrum. Although recent researches have demonstrated the effectiveness of employing long short-term memory (LSTM) recurrent neural network (RNN) in voice conversion, traditional LSTM-RNN based approaches usually focus on temporal evolutions of speech features only. In this paper, we improve the conventional LSTM-RNN method for voice conversion by employing the two-dimensional time-frequency LSTM (TFLSTM) to model spectro-temporal warping along both time and frequency axes. A multi-task learned structured output layer (SOL) is afterward adopted to capture the dependencies between spectral and pitch parameters for further improvement, where spectral parameter targets are conditioned upon pitch parameters prediction. Experimental results show the proposed approach outperforms conventional systems in speech quality and speaker similarity.
What problem does this paper attempt to address?