Transformer-PSS: A High-Efficiency Prosodic Speech Synthesis Model based on Transformer

Yutian Wang,Kai Cui,Hui Wang,Jingling Wang,Qin Zhang
DOI: https://doi.org/10.1109/ICIBA50161.2020.9277162
2020-01-01
Abstract:Much attention has been given to prosodic speech synthesis with the progress of human-computer interaction and automatic content generation. However, one of its disadvantages is high computational complexity which makes it hard to train and inference. Most of the popular models, such as GST-Tacotron, use a recursive decoder to generate the Mel spectrum frame-by-frame which might be the main bottleneck restraining their calculation speed. In this paper, we propose Transformer-PSS, a high-efficiency prosodic speech synthesis model that employs the transformer architecture to generate Mel spectrum in a non-autoregressive way. Experiments show that our model is 4.6x faster than popular models and have better generative quality on objective evaluations.
What problem does this paper attempt to address?