A Transformer-based Chinese Non-autoregressive Speech Synthesis Scheme

Yueqing Cai,Wenbi Rao
DOI: https://doi.org/10.1145/3459212.3459222
2021-03-19
Abstract:At present, the main research hotspot in the field of speech synthesis is still English speech synthesis, and there are few non-autoregressive Chinese speech synthesis models. During the Chinese migration process of FastSpeech2, we found that the naturalness of the synthesized audio was not good enough and there were some abnormal interruptions and incorrect pronunciation. Inspired by the training method of generative adversarial network, we use FastSpeech2 as the generator, and add a discriminator to force FastSpeech2 to generate audio more similar to the real audio. In order to realize a complete text to Mel spectrogram speech synthesis scheme, we design a text-to-phoneme converter based on corpus and rule constraints. And we conduct experiments on Baker dataset. The results show that our model achieves a better Mel Cepstral Distance than that of FastSpeech2. And our model can achieve a mean opinion score of 3.94, which is slightly better than the original model.
What problem does this paper attempt to address?