Abstract:With the demand for autonomous control and personalized speech generation, the style control and transfer in Text-to-Speech (TTS) is becoming more and more important. In this paper, we propose a new TTS system that can perform style transfer with interpretability and high fidelity. Firstly, we design a TTS system that combines variational autoencoder (VAE) and diffusion refiner to get refined mel-spectrograms. Specifically, a two-stage and a one-stage system are designed respectively, to improve the audio quality and the performance of style transfer. Secondly, a diffusion bridge of quantized VAE is designed to efficiently learn complex discrete style representations and improve the performance of style transfer. To have a better ability of style transfer, we introduce ControlVAE to improve the reconstruction quality and have good interpretability simultaneously. Experiments on LibriTTS dataset demonstrate that our method is more effective than baseline models.

What problem does this paper attempt to address?

This paper aims to solve the style transfer and interpretability problems in Text - to - Speech (TTS) systems. Specifically, the author proposes a new TTS system - IST - TTS (Interpretable Style Transfer for Text - to - Speech), which combines Variational Autoencoder (VAE) and Diffusion Probabilistic Models (DPM) to improve audio quality and style transfer performance. In addition, in order to achieve better style transfer quality and good style decoupling, ControlVAE is introduced in the paper, and a Diffusion Bridge is designed to enhance the diversity of the generated style representations. ### Main Contribution Points: 1. **Propose a new TTS system**: This system integrates VAE and DPM and improves the performance of style transfer through two - stage and single - stage training processes. 2. **Design a Diffusion Bridge**: By quantifying the output of VAE to model the diversity of style representations in the latent space, the effect of style transfer is improved. 3. **Introduce ControlVAE**: Replace the traditional VAE to obtain better reconstruction ability and style interpretability. ### Experimental Results: - **Objective Evaluation**: Using Fréchet Distance (FD) and Mel - Cepstral Distortion (MCD) as evaluation metrics, the experimental results show that IST - TTS is superior to the baseline model in terms of the quality and similarity of style transfer. - **Subjective Evaluation**: Through the 5 - level Mean Opinion Score (MOS) and Similarity Mean Opinion Score (SMOS) tests, IST - TTS shows higher quality and similarity in both parallel style transfer and non - parallel style transfer tasks. - **Ablation Study**: Removing Vector Quantization (VQ), the Diffusion Bridge, or using the original VAE instead of ControlVAE will lead to performance degradation, which proves the effectiveness of these components. ### Conclusion: The IST - TTS method proposed in this paper has made significant progress in improving the style transfer quality and interpretability of TTS systems. Future work will further explore methods to improve style interpretability.

Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Fine-grained style control in Transformer-based Text-to-speech Synthesis

Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis

Innovative Speaker-Adaptive Style Transfer VAE-WadaIN for Enhanced Voice Conversion in Intelligent Speech Processing

PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions

MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis

Disentangling Style and Speaker Attributes for TTS Style Transfer

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis

Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

Exploring synthetic data for cross-speaker style transfer in style representation based TTS

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models

Fine-grained Text Style Transfer with Diffusion-Based Language Models