Abstract:Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves training efficiency. Meanwhile, any two inputs at different times are connected directly by a self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).

An Optimized Neural Network Based Prosody Model of Chinese Speech Synthesis System

Prosody Model for Mandarin Text-to-Speech System

The Study of the Trainable Prosodic Model for Chinese Text to Speech System

Learning Prosodic Patterns for Mandarin Speech Synthesis

IMPROVING NATURALNESS AND CONTROLLABILITY OF SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS BY LEARNING LOCAL PROSODY REPRESENTATIONS

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Modeling Prosody Patterns for Chinese Expressive Text-to-speech Synthesis

Pitch Prediction for Mandarin TTS with Mutual Prosodic Constraint

A Novel Prosody Adaptation Method for Mandarin Concatenation-Based Text-to-speech System

The Statistical Model of Chinese Word Contours Based on Fuzzy Clustering Method

A Superposed Prosodic Model for Chinese Text-To-Speech Synthesis

A New Chinese Text-to-speech System with High Naturalness

Mandarin dialog prosody model

Pitch Models of Mandarin Text-to-speech

Clustering and Feature Learning Based F0 Prediction for Chinese Speech Synthesis

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

Modeling prosody pattern of Chinese expressive speech and its application in personalized speech conversion

Data mining for learning mandarin prosodic models

Hierarchical Prosody Modeling and Control in Non-Autoregressive Parallel Neural TTS

Prosody Analysis And Modeling For Emotional Speech Synthesis

Neural Speech Synthesis with Transformer Network.