CRCTTS: Convolution-Recurrent-Convolution Text-to-Speech System.

Kuo Chen,Xuebin Sun
DOI: https://doi.org/10.1145/3548608.3559304
2022-01-01
Abstract:End-to-end speech synthesis technology has already replaced the positions of Statistical Parametric Speech Synthesis (SPSS) in text-to-speech (TTS) field. The end-to-end model based on neural network, does not require a lot of domain knowledge but synthesize more natural speeches. Tacotron is the first model that can synthesize speeches which even human is hard to distinguish. We propose a new end-to-end speech synthesis system which is called Convolution-Recurrent-Convolution Text-to-Speech (CRCTTS). We chose Tacotron as our baseline model and adjust the architecture through fully Convolution Neural Network (CNN) module and Dynamic Convolution Attention (DCA). Besides, we also introduce the attention guided mechanism to our model for accelerating the attention alignment in the decoder module. The model we proposed has been proved that can synthesis speech with better quality and cost less time in terms of training stage and synthesis stage than the baseline model with these technologies.
What problem does this paper attempt to address?