Multi-Task Learning for Prosodic Structure Generation Using BLSTM RNN with Structured Output Layer

Yuchen Huang,Zhiyong Wu,Runnan Li,Helen Meng,Lianhong Cai
DOI: https://doi.org/10.21437/interspeech.2017-949
2017-01-01
Abstract:Prosodic structure generation from text plays an important role in Chinese text-to-speech (TTS) synthesis, which greatly influences the naturalness and intelligibility of the synthesized speech. This paper proposes a multi-task learning method for prosodic structure generation using bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) and structured output layer (SOL). Unlike traditional methods where prerequisites such as lexicon word or even syntactic tree are usually required as the input, the proposed method predicts prosodic boundary labels directly from Chinese characters. BLSTM RNN is used to capture the bidirectional contextual dependencies of prosodic boundary labels. SOL further models correlations between prosodic structures, lexicon words as well as part-of-speech (POS), where the prediction of prosodic boundary labels are conditioned upon word tokenization and POS tagging results. Experimental results demonstrate the effectiveness of the proposed method.
What problem does this paper attempt to address?