A Novel Method for Mandarin Speech Synthesis by Inserting Prosodic Structure Prediction into Tacotron2.

Liu Junmin,Beijing Sankuai Online Technology Company Ltd.,Zhang Chunxia,Shi Guang
DOI: https://doi.org/10.1007/s13042-021-01365-x
2021-01-01
International Journal of Machine Learning and Cybernetics
Abstract:Speech synthesis, an artificial intelligence technology that employs computers to imitate human speech, has played a crucial role in human–computer interaction since it can automatically convert text into speech with satisfactory intelligibility and naturalness. Tacotron2 is the second generation end-to-end English speech synthesis model developed by Google. As Mandarin becomes more and more popular in the world, the associated speech synthesis technologies have been applied in various applications. Aiming at extending Tacotron2 to synthesize Mandarin speech, we propose in this paper a novel synthesis method by adding a Mandarin-to-PinYin module and a prosodic structure prediction model into Tacotron2. By evaluating synthesized results with subjective and objective methods, the added prosodic structure prediction model is demonstrated to help Tacotron2 synthesize more natural and human-like Mandarin speech.
What problem does this paper attempt to address?