SJTU Entry in Blizzard Challenge 2019

Bo Chen,Kuan Chen,Zhijun Liu,Zhihang Xu,Songze Wu,Chenpeng Du,Muyang Li,Sijun Li,Kai Yu
DOI: https://doi.org/10.21437/blizzard.2019-14
2019-01-01
Abstract:This paper presents the techniques that were used in sjtu-tts entry in Blizzard Challenge 2019. The main architecture is Tacotron with WaveNet vocoder. The corpus in BC2019 is 8 hours audios from a Chinese male speaker with mixed Mandarin and English speech. The audios and transcriptions are found on the Internet with heavily corruption and noise. To deal with the corpus, our system is divided into 4 parts, data preprocessing, spectrogram model, WaveNet vocoder and speech bandwidth extension. The WaveNet vocoder is more relative to the speech quality and the spectrogram model is more relative to the prosody(pitch and duration). We didn’t successfully train a good WaveNet vocoder for the predicted mel-spectrogram. Thus, some useful techniques in other parts have no significant improvement after WaveNet vocoding. These attempts which were not included in the final submission are also analyzed.
What problem does this paper attempt to address?