End-to-End Mongolian Text-to-Speech System

Jingdong Li,Hui Zhang,Rui Liu,Xueliang Zhang,Feilong Bao
DOI: https://doi.org/10.1109/iscslp.2018.8706263
2018-01-01
Abstract:Speech synthesis, or text-to-speech (TTS), generates a speech waveform of the given text. To build a satisfactory TTS system, a large natural speech corpus is requested. In the traditional approach, the corpus should be accompanied with precise annotations. However, the annotation is difficult and costly. Recently, end-to-end speech synthesis methods are proposed, which eliminated the requirement of annotation. The end-to-end methods make the development of TTS system less costly and easier. We used the state-of-the-art end-to-end Tacotron model in the Mongolian TTS task. With much more unannotated speech data (about 17 hours), the new system beats the old best Mongolian TTS system, which is trained on a small amount of annotated data (about 5 hours), with a big margin. The new mean opinion score (MOS) is 3.65 vs 2.08 which is the old one. The proposed system becomes the first Mongolian TTS system can be utilized in real applications.
What problem does this paper attempt to address?