The Iflytek System for Blizzard Machine Learning Challenge 2017-ES1

Li-Juan Liu,Chuang Ding,Ya-Jun Hu,Zhen-Hua Ling,Yuan Jiang,Ming Zhou,Si Wei
DOI: https://doi.org/10.1109/asru.2017.8268999
2017-01-01
Abstract:This paper introduces the speech synthesis system submitted by IFLYTEK for the Blizzard Machine Learning Challenge 2017-ES1. Linguistic and acoustic features from a 4hour corpus were released for this task. Participants are expected to build a speech synthesis system on the given linguist and acoustic features without using any external data. Our system is composed of a long short term memory (LSTM) recurrent neural network (RNN)-based acoustic model and a generative adversarial network (GAN)-based post-filter for mel-cepstra. Two approaches to build GAN-based post-filter are implemented and compared in our experiments. The first one is to predict the residuals of mel-cepstra given the mel-cepstra predicted by the LSTM-based acoustic model. However, this method leads to unstable synthetic speech sounds in our experiments, which may be due to the poor quality of analysis-synthesis speech using the natural acoustic features given by this corpus. The other approach is to ignore the detailed components of natural mel-cepstra by dimension reduction using principal component analysis (PCA) and then recover them back using GAN given the main PCA components. At synthesis time, mel-cepstra predicted by the RNN acoustic model are first projected to the main PCA components, which are then sent to the GAN for detail recovering. Finally, the second approach is used in the final submitted system. The evaluation results show the effectiveness of our submitted system.
What problem does this paper attempt to address?