Lightweight Convolution-Based Chinese Speech Synthesis Method

Ruotong Yang,Yunfei Shen,Nian Yi,Zhixing Fan,Jiajun Liu,Wumaier Aishan
DOI: https://doi.org/10.1109/cvidliccea56201.2022.9824329
2022-01-01
Abstract:Speech synthesis technology is one of the key technologies for human-computer speech interaction and is widely used in various fields such as audiobooks and information broadcasting. In this paper, we propose a speech synthesis model based on lightweight convolution (Lightweight Convolution-Tacotron2, LConv-T) to address the problems of severe long-distance information loss and slow inference in Tacotron2 speech synthesis model using recurrent neural networks. The encoder uses multiple lightweight convolutional modules connected in a densely connected manner to obtain the contextual information of the input text over long distances. To address the shortcomings of the proposed LConv-T model for speech synthesis, this paper further proposes a Tacotron2-based feature fusion speech synthesis model (Dynamic Lightweight Convolution-Tacotron2, DLConv-T), which can improve the stability of speech synthesis by using the Bi-LSTM and dynamic lightweight convolution modules respectively. The text feature extraction effectively improves the speech synthesis effect. The experimental results show that compared with the Tacotron2 model, the LConv-T and DLConv-T models reduce the objective evaluation MCD values by 0.15db and 0.42db, and improve the subjective evaluation MOS by 0.15 and 0.47, respectively.
What problem does this paper attempt to address?