Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

Wenhao Guan,Tao Li,Yishuang Li,Hukai Huang,Qingyang Hong,Lin Li
2023-07-11
Abstract:With the demand for autonomous control and personalized speech generation, the style control and transfer in Text-to-Speech (TTS) is becoming more and more important. In this paper, we propose a new TTS system that can perform style transfer with interpretability and high fidelity. Firstly, we design a TTS system that combines variational autoencoder (VAE) and diffusion refiner to get refined mel-spectrograms. Specifically, a two-stage and a one-stage system are designed respectively, to improve the audio quality and the performance of style transfer. Secondly, a diffusion bridge of quantized VAE is designed to efficiently learn complex discrete style representations and improve the performance of style transfer. To have a better ability of style transfer, we introduce ControlVAE to improve the reconstruction quality and have good interpretability simultaneously. Experiments on LibriTTS dataset demonstrate that our method is more effective than baseline models.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper aims to solve the style transfer and interpretability problems in Text - to - Speech (TTS) systems. Specifically, the author proposes a new TTS system - IST - TTS (Interpretable Style Transfer for Text - to - Speech), which combines Variational Autoencoder (VAE) and Diffusion Probabilistic Models (DPM) to improve audio quality and style transfer performance. In addition, in order to achieve better style transfer quality and good style decoupling, ControlVAE is introduced in the paper, and a Diffusion Bridge is designed to enhance the diversity of the generated style representations. ### Main Contribution Points: 1. **Propose a new TTS system**: This system integrates VAE and DPM and improves the performance of style transfer through two - stage and single - stage training processes. 2. **Design a Diffusion Bridge**: By quantifying the output of VAE to model the diversity of style representations in the latent space, the effect of style transfer is improved. 3. **Introduce ControlVAE**: Replace the traditional VAE to obtain better reconstruction ability and style interpretability. ### Experimental Results: - **Objective Evaluation**: Using Fréchet Distance (FD) and Mel - Cepstral Distortion (MCD) as evaluation metrics, the experimental results show that IST - TTS is superior to the baseline model in terms of the quality and similarity of style transfer. - **Subjective Evaluation**: Through the 5 - level Mean Opinion Score (MOS) and Similarity Mean Opinion Score (SMOS) tests, IST - TTS shows higher quality and similarity in both parallel style transfer and non - parallel style transfer tasks. - **Ablation Study**: Removing Vector Quantization (VQ), the Diffusion Bridge, or using the original VAE instead of ControlVAE will lead to performance degradation, which proves the effectiveness of these components. ### Conclusion: The IST - TTS method proposed in this paper has made significant progress in improving the style transfer quality and interpretability of TTS systems. Future work will further explore methods to improve style interpretability.