Enhancing Prosodic Features by Adopting Pre-trained Language Model in Bahasa Indonesia Speech Synthesis

Lixuan Zhao,Jian Yang,Qinglai Qin
DOI: https://doi.org/10.1145/3446132.3446196
2020-12-24
Abstract:Deep neural network text-to-speech (TTS) systems can produce high-quality audio. However, modern TTS systems usually need a sizable of studio-quality pairs as input. In view of the insufficient research on Bahasa Indonesia, available data are usually worse in term of both quality and size. The End-to-End(E2E) TTS systems trained on those corpora are difficult to generate satisfactory speech, especially the prosodic features are not obvious. Therefore, we propose a method to enhance the prosodic features of synthesized speech based on GST-Tacotron2 model, and pre-trained language model with the BERT (Bidirectional Encoder Representation from Transformers) model. The BERT learned from large number of unlabeled text data contains rich linguistic information, which can help TTS systems produce the more obvious prosodic features. The subjective evaluation of our experimental results shows that the proposed method can indeed enhance the rhythm of synthesized speech.
What problem does this paper attempt to address?