Quasi-Fully Convolutional Neural Network With Variational Inference For Speech Synthesis

Mu Wang,Xixin Wu,Zhiyong Wu,Shiyin Kang,Deyi Tuo,Guangzhi Li,Dan Su,Dong Yu,Helen Meng
DOI: https://doi.org/10.1109/icassp.2019.8682528
2019-01-01
Abstract:Recurrent neural networks, such as gated recurrent units (GRUs) and long short-term memory (LSTM), are widely used on acoustic modeling for speech synthesis. However, such sequential generating processes are not friendly to today's massively parallel computing devices. We introduce a fully convolutional neural network (CNN) model, which can effiently run on parallel processers, for speech synthesis. To improve the quality of the generated acoustic features, we strengthen our model with variational inference. We also use quasi-recurrent neural networks (QRNNs) to smoothen the generated acoustic features. Finally, a high-quality parallel WaveNet model is used to generate audio samples. Our contributions are twofold. First, we show that CNNs with variational inference can generate highly natural speech on a par with end-to-end models; the use of QRNNs further improves the synthetic quality by reducing trembling of generated acoustic features and introduces very little run-time overheads. Second, we show some techniques to further speed up the sampling process of the parallel WaveNet model.
What problem does this paper attempt to address?