StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis

Xueyuan Chen,Xi Wang,Shaofei Zhang,Lei He,Zhiyong Wu,Xixin Wu,Helen Meng

2023-12-19

Abstract:The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution in the training data. To address these issues, in this paper, we propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis. Firstly, a text style encoder is pre-trained with a large amount of unlabeled text-only data. Secondly, a spectrogram style extractor based on VQ-VAE is pre-trained in a self-supervised manner, with plenty of audio data that covers complex style variations. Then a novel architecture with two encoder-decoder paths is specially designed to model the pronunciation and high-level style expressiveness respectively, with the guidance of the style extractor. Both objective and subjective evaluations demonstrate that our proposed method can effectively improve the naturalness and expressiveness of the synthesized speech in audiobook synthesis especially for the role and out-of-domain scenarios.

Sound,Artificial Intelligence,Audio and Speech Processing

What problem does this paper attempt to address?

The paper primarily addresses the issue of insufficient expressiveness in speech synthesis for audiobooks. Specifically, while existing Text-to-Speech (TTS) systems can generate high-quality, neutral-style speech, there remains a significant gap in expressiveness compared to real human speech. This challenge is particularly pronounced when synthesizing long, expressively rich datasets (such as audiobooks), as the extensive vocal features often degrade into an averaged prosodic style. To address the aforementioned issue, the paper proposes a self-supervised style enhancement method that leverages pre-training techniques based on Vector-Quantized Variational AutoEncoder (VQ-VAE) to improve the expressiveness of audiobook speech synthesis. The specific methods include: 1. **Pre-training of the Text Style Encoder**: First, the text style encoder is pre-trained using a large amount of easily accessible unannotated text data. 2. **Pre-training of the Spectrogram Style Extractor**: Secondly, a spectrogram style extractor is constructed based on VQ-VAE and is self-supervisedly pre-trained using a large amount of audio data that covers complex style variations from other domains. 3. **Specially Designed TTS Architecture**: On this basis, a novel TTS architecture containing two encoding-decoding paths is designed, where one path focuses on pronunciation modeling and the other path focuses on high-level style expression modeling. This design helps to enhance the expressiveness of synthesized speech in complex scenarios. Experimental results show that the proposed style enhancement method can effectively improve speech naturalness and expressiveness, especially in character and cross-domain scenarios. Additionally, the paper validates the effectiveness of the proposed method through both subjective and objective evaluations.

StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Unsupervised Multi-scale Expressive Speaking Style Modeling with Hierarchical Context Information for Audiobook Speech Synthesis.

Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis

Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis

Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

Say Anything with Any Style

Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis.

Exploring synthetic data for cross-speaker style transfer in style representation based TTS

MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling

Innovative Speaker-Adaptive Style Transfer VAE-WadaIN for Enhanced Voice Conversion in Intelligent Speech Processing

Towards Multi-Scale Style Control for Expressive Speech Synthesis

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer