DOP-Tacotron: a Fast Chinese TTS System with Local-based Attention

Ting He,Wei Zhao,Li Xu
DOI: https://doi.org/10.1109/ccdc49329.2020.9164203
2020-01-01
Abstract:As used in human-robot interaction, text-to-speech(TTS) systems can generate human-like speech from written text input to mimic human speakers. End-to-end TTS systems are widely explored in recent years. In this paper, we propose a fast trained end-to-end Chinese TTS system DOP-Tacotron. We propose the DOP module for the encoder and the post-processing network. DOP has almost similar effects with CBHG module while using 35.5% fewer parameters. We use local-based attention mechanism, which always follows the previous attention state. DOP-Tacotron achieves a 3.683 subjective 5-scale mean opinion score of naturalness on Chinese Mandarin, outperforming Tacotron in terms of naturalness. In addition, DOP-Tacotron adds stop-talk-loss to loss for spectrogram, and uses sample-length-batch for mini batch and accurate Chinese pinyin with punctuation as input. Our proposed TTS system can be easily trained since training time of DOP-Tacotron is only 2.5 hours.
What problem does this paper attempt to address?