Building Mongolian Tts Front-End With Encoder-Decoder Model By Using Bridge Method And Multi-View Features

Rui Liu,Feilong Bao,Guanglai Gao
DOI: https://doi.org/10.1007/978-3-030-36802-9_68
2019-01-01
Abstract:In the context of text-to-speech systems (TTS), a front-end is a critical step for extracting linguistic features from given input text. In this paper, we propose a Mongolian TTS front-end which joint training Grapheme-to-Phoneme conversion (G2P) and phrase break prediction (PB). We use a bidirectional long short-term memory (LSTM) network as the encoder side, and build two decoders for G2P and PB that share the same encoder. Meanwhile, we put the source input features and encoder hidden states together into the Decoder, aim to shorten the distance between the source and target sequence and learn the alignment information better. More importantly, to obtain a robust representation for Mongolian words, which are agglutinative in nature and lacks sufficient training corpus, we design specific multi-view input features for it. Our subjective and objective experiments have demonstrated the effectiveness of this proposal.
What problem does this paper attempt to address?