The USTC System for Blizzard Machine Learning Challenge 2017-ES2

Ya-Jun Hu,Li-Juan Liu,Chuang Ding,Zhen-Hua Ling,Li-Rong Dai
DOI: https://doi.org/10.1109/asru.2017.8268998
2017-01-01
Abstract:The Blizzard Machine Learning Challenge (BMLC) aims to liberate participants from speech-specific processing when building speech synthesis systems. This paper describes the USTC system for the ES2 sub-task in BMLC2017, which requires participants to train a model to directly predict waveforms from linguistic features. We investigate three aspects of waveform modeling when preparing our system for this task. First, two different model structures for waveform modeling, i.e., WaveNet and SampleRNN, are compared on this task. Second, a strategy of using features extracted from waveforms as intermediate representations for waveform modeling is studied. Experimental results show that using low-level features (STFT amplitude spectra) as intermediate representations can achieve similar performance as using high-level features (mel-cepstra and F0). Third, the feasibility of applying WaveNet to wideband speech signals with more than 256 quantization levels is verified by experiments. Finally, a system which adopts STFT amplitude spectra as intermediate representations to model 24kHz speech waveforms with 1024 mu-law quantization levels is submitted for evaluation. The evaluation results of BMLC2017 demonstrate the effectiveness of our proposed methods.
What problem does this paper attempt to address?