Deep Neural Network Based Voice Conversion with A Large Synthesized Parallel Corpus

Zhengqi Wen,Kehuang Li,Jianhua Tao,Chin-Hui Lee
DOI: https://doi.org/10.1109/apsipa.2016.7820716
2016-01-01
Abstract:we propose a voice conversion framework to map the speech features of a source speaker to a target speaker based on deep neural networks (DNNs). Due to a limited availability of the parallel data needed for a pair of source and target speakers, speech synthesis and dynamic time warping are utilized to construct a large parallel corpus for DNN training. With a small corpus to train DNNs, a lower log spectral distortion can still be seen over the conventional Gaussian mixture model (GMM) approach, trained with the same data. With the synthesized parallel corpus, a speech naturalness preference score of about 54.5% vs. 32.8% and a speech similarity preference score of about 52.5% vs. 23.6% are observed for the DNN-converted speech from the large parallel corpus when compared with the DNN-converted speech from the small parallel corpus.
What problem does this paper attempt to address?