PE-wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS

Zhao-Ci Liu,Liping Chen,Ya-Jun Hu,Zhen-Hua Ling,Jia Pan
DOI: https://doi.org/10.1109/taslp.2024.3449148
2024-01-01
Abstract:This paper investigates leveraging large-scale untranscribed speech data to enhance the prosody modelling capability of text-to-speech (TTS) models. On the basis of the self-supervised speech model wav2vec 2.0, Prosody-Enhanced wav2vec (PE-wav2vec) is proposed by introducing prosody learning. Specifically, prosody learning is achieved by applying supervision from the linear predictive coding (LPC) residual signals on the initial Transformer blocks in the wav2vec 2.0 architecture. The embedding vectors extracted with the initial Transformer blocks of the PE-wav2vec model are utilised as prosodic representations for the corresponding frames in a speech utterance. To apply the PE-wav2vec representations in TTS, an acoustic model named Speech Synthesis model conditioned on Self-Supervisedly Learned Prosodic Representations (S4LPR) is designed on the basis of FastSpeech 2. The experimental results demonstrate that the proposed PE-wav2vec model can provide richer prosody descriptions of speech than the vanilla wav2vec 2.0 model can. Furthermore, the S4LPR model using PE-wav2vec representations can effectively improve the subjective naturalness and reduce the objective distortions of synthetic speech compared with baseline models.
What problem does this paper attempt to address?