Prosodic Structure Prediction Using Deep Self-attention Neural Network

Yao Du,Zhiyong Wu,Shiyin Kang,Dan Su,Dong Yu,Helen Meng
DOI: https://doi.org/10.1109/apsipaasc47483.2019.9023259
2019-01-01
Abstract:Prosodic structure prediction is a key part of the text analysis front-end of the text-to-speech (TTS) system. It predicts prosodic boundary tags given the input text context, which is essential to the naturalness of synthesized speech. Conventional methods such as conditional random fields (CRF) and recurrent neural network (RNN) have been successfully applied to this task. However, the lack of modeling temporal dependencies at different scopes (the short-term dependency as well as the long-span dependency across the entire sentence) limits their performance. In this paper, we propose a self-attention network with semantic features extracted by a pre-trained bidirectional encoder representations from Transformers (BERT) model to predict the prosodic structure. Experimental results show that the proposed approach outperforms the strong baseline CRF model with an absolute improvement of 3.4% in total accuracy.
What problem does this paper attempt to address?