Abstract:This paper presents I2R-NWPU-NUS team’s text-to-speech system to Blizzard Challenge 2018. Instead of using unit selection based concatenative speech sysnthesis previous years. we adopt the general deep neural network (DNN) statistical parametric method to synthesize the speech. The frame level acoustic parameters and phone duration are modelled using bidirectional long short-term memory (BLSTM) recurrent neural networks (RNNs). For duration model, 5 states of phone duration are used to predict the duration of each phoneme. Finally, the predicted acoustic parameters (MGC, LF0, V/UV) are taken as inputs to WORLD vocoder to generate the synthetic speech. The listening tests show improvement compared with the results of DNN baseline system.

The I2R-NWPU-NUS Text-to-Speech System for Blizzard Challenge 2018