EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

Chenfeng Miao,Qingying Zhu,Minchuan Chen,Jun Ma,Shaojun Wang,Jing Xiao
DOI: https://doi.org/10.1109/taslp.2024.3369528
2024-01-01
Abstract:Recently, the field of Text-to-Speech (TTS) has been dominated by one-stage text-to-waveform models which have significantly improved speech quality compared to two-stage models. In this work, we propose EfficientTTS 2 (EFTS2), a one-stage high-quality end-to-end TTS framework that is fully differentiable and highly efficient. Our method adopts an adversarial training process, with a differentiable aligner and a hierarchical-VAE-based waveform generator. These design choices free the model from the use of external aligners, invertible structures, and complex training procedures as most previous TTS works have. Moreover, we extend EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, an end-to-end VC model that allows high-quality speech-to-speech conversion. Experimental results suggest that the two proposed models achieve better or at least comparable speech quality compared to baseline models, while also providing faster inference speeds and smaller model sizes.
engineering, electrical & electronic,acoustics
What problem does this paper attempt to address?