Prosody Modelling with Pre-trained Cross-utterance Representations for Improved Speech Synthesis

Ya-Jie Zhang,Chao Zhang,Wei Song,Zhengchen Zhang,Youzheng Wu,Xiaodong He
DOI: https://doi.org/10.1109/taslp.2023.3278184
2023-01-01
Abstract:When humans speak multiple utterances in a continuous manner, the prosodic features generated in each utterance are related to those in its neighbouring utterances. Such cross-utterance (CU) dependencies are often ignored by the current neural text-to-speech (TTS) systems, which reduces the naturalness and expressiveness of the synthesized speeches. In this paper, we propose to improve the prosody modelling ability of neural TTS systems using pre-trained CU acoustic and text representations. Such CU acoustic representations are derived using the Wav2Vec 2.0 model (W2V2) from the synthesized audios of the past utterances, while the CU text representations are extracted using the Bidirectional Encoder Representation from Transformers (BERT) model from the scripts of the future utterances. Experimental results on a Mandarin audiobook and an English audiobook showed the naturalness and expressiveness of the synthesized audios were significantly improved by incorporating such pre-trained W2V2 and BERT CU representations into the Fastspeech2 TTS framework.
engineering, electrical & electronic,acoustics
What problem does this paper attempt to address?