Submission from SCUT for Blizzard Challenge 2020

Yitao Yang,Jinghui Zhong,Shehui Bu
DOI: https://doi.org/10.21437/vcc_bc.2020-6
2020-01-01
Abstract:In this paper, we describe the SCUT text-to-speech synthesis system for the Blizzard Challenge 2020 and the task is to build a voice from the provided Mandarin dataset. We begin with our system architecture composed of an end-to-end structure to convert acoustic features from textual sequences and a WaveRNN vocoder to restore the waveform. Then a BERT-based prosody prediction model to specify the prosodic information of the content is introduced. The text processing module is adjusted to uniformly encode both Mandarin and English texts, then a twostage training method is utilized to build a bilingual speech synthesis system. Meanwhile, we employ forward attention and guided attention mechanisms to accelerate the model’s convergence. Finally, the reasons for our inefficient performance presented in the evaluation results are discussed.
What problem does this paper attempt to address?