Abstract:End-to-end text-to-speech synthesis systems achieved immense success in recent times, with improved naturalness and intelligibility. However, the end-to-end models, which primarily depend on the attention-based alignment, do not offer an explicit provision to modify/incorporate the desired prosody while synthesizing the speech. Moreover, the state-of-the-art end-to-end systems use autoregressive models for synthesis, making the prediction sequential. Hence, the inference time and the computational complexity are quite high. This paper proposes Prosody-TTS, a data-efficient end-to-end speech synthesis model that combines the advantages of statistical parametric models and end-to-end neural network models. It also has a provision to modify or incorporate the desired prosody at the finer level by controlling the fundamental frequency ( ) and the phone duration. Generating speech utterances with appropriate prosody and rhythm helps in improving the naturalness of the synthesized speech. We explicitly model the duration of the phoneme and the to have a finer level control over them during the synthesis. The model is trained in an end-to-end fashion to directly generate the speech waveform from the input text, which in turn depends on the auxiliary subtasks of predicting the phoneme duration, , and Mel spectrogram. Experiments on the Telugu language data of the IndicTTS database show that the proposed Prosody-TTS model achieves state-of-the-art performance with a mean opinion score of 4.08, with a very low inference time using just 4 hours of training data.

Chinese Prosody Generation Based on C-ToBI Representation for Text-To-Speech

Modeling Prosody Patterns for Chinese Expressive Text-to-speech Synthesis

A Novel Prosody Adaptation Method for Mandarin Concatenation-Based Text-to-speech System

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

DOP-Tacotron: a Fast Chinese TTS System with Local-based Attention

Into-TTS : Intonation Template Based Prosody Control System

Objective Evaluation Methods for Chinese Text-To-Speech Systems

IMPROVING NATURALNESS AND CONTROLLABILITY OF SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS BY LEARNING LOCAL PROSODY REPRESENTATIONS

Prosody Modelling with Pre-trained Cross-utterance Representations for Improved Speech Synthesis

Automatic Prosody Annotation with Pre-Trained Text-Speech Model

An Optimized Neural Network Based Prosody Model of Chinese Speech Synthesis System

A Chinese Text-to-Speech System

Prosody Analysis And Modeling For Emotional Speech Synthesis

High quality Chinese text-to-speech system - BEYOND

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

A Tree-Based Model of Prosodic Phrasing for Chinese Text-to-Speech Systems

Prosody Model for Mandarin Text-to-Speech System

Syntactic Representation Learning for Neural Network Based TTS with Syntactic Parse Tree Traversal

Assigning Break Indices for Unrestricted Texts in Mandarin Text to Speech System