StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis

Xueyuan Chen,Xi Wang,Shaofei Zhang,Lei He,Zhiyong Wu,Xixin Wu,Helen Meng
2023-12-19
Abstract:The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution in the training data. To address these issues, in this paper, we propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis. Firstly, a text style encoder is pre-trained with a large amount of unlabeled text-only data. Secondly, a spectrogram style extractor based on VQ-VAE is pre-trained in a self-supervised manner, with plenty of audio data that covers complex style variations. Then a novel architecture with two encoder-decoder paths is specially designed to model the pronunciation and high-level style expressiveness respectively, with the guidance of the style extractor. Both objective and subjective evaluations demonstrate that our proposed method can effectively improve the naturalness and expressiveness of the synthesized speech in audiobook synthesis especially for the role and out-of-domain scenarios.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily addresses the issue of insufficient expressiveness in speech synthesis for audiobooks. Specifically, while existing Text-to-Speech (TTS) systems can generate high-quality, neutral-style speech, there remains a significant gap in expressiveness compared to real human speech. This challenge is particularly pronounced when synthesizing long, expressively rich datasets (such as audiobooks), as the extensive vocal features often degrade into an averaged prosodic style. To address the aforementioned issue, the paper proposes a self-supervised style enhancement method that leverages pre-training techniques based on Vector-Quantized Variational AutoEncoder (VQ-VAE) to improve the expressiveness of audiobook speech synthesis. The specific methods include: 1. **Pre-training of the Text Style Encoder**: First, the text style encoder is pre-trained using a large amount of easily accessible unannotated text data. 2. **Pre-training of the Spectrogram Style Extractor**: Secondly, a spectrogram style extractor is constructed based on VQ-VAE and is self-supervisedly pre-trained using a large amount of audio data that covers complex style variations from other domains. 3. **Specially Designed TTS Architecture**: On this basis, a novel TTS architecture containing two encoding-decoding paths is designed, where one path focuses on pronunciation modeling and the other path focuses on high-level style expression modeling. This design helps to enhance the expressiveness of synthesized speech in complex scenarios. Experimental results show that the proposed style enhancement method can effectively improve speech naturalness and expressiveness, especially in character and cross-domain scenarios. Additionally, the paper validates the effectiveness of the proposed method through both subjective and objective evaluations.