Abstract:End-to-end text-to-speech synthesis systems achieved immense success in recent times, with improved naturalness and intelligibility. However, the end-to-end models, which primarily depend on the attention-based alignment, do not offer an explicit provision to modify/incorporate the desired prosody while synthesizing the speech. Moreover, the state-of-the-art end-to-end systems use autoregressive models for synthesis, making the prediction sequential. Hence, the inference time and the computational complexity are quite high. This paper proposes Prosody-TTS, a data-efficient end-to-end speech synthesis model that combines the advantages of statistical parametric models and end-to-end neural network models. It also has a provision to modify or incorporate the desired prosody at the finer level by controlling the fundamental frequency ( ) and the phone duration. Generating speech utterances with appropriate prosody and rhythm helps in improving the naturalness of the synthesized speech. We explicitly model the duration of the phoneme and the to have a finer level control over them during the synthesis. The model is trained in an end-to-end fashion to directly generate the speech waveform from the input text, which in turn depends on the auxiliary subtasks of predicting the phoneme duration, , and Mel spectrogram. Experiments on the Telugu language data of the IndicTTS database show that the proposed Prosody-TTS model achieves state-of-the-art performance with a mean opinion score of 4.08, with a very low inference time using just 4 hours of training data.

Robust and fine-grained prosody control of end-to-end speech synthesis

Emotional speech synthesis with rich and granularized control

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

Focusing on attention: prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis

Towards Fine-Grained Prosody Control for Voice Conversion

Controllable Prosody Generation With Partial Inputs

Fine-Grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis

Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

BI-LEVEL STYLE AND PROSODY DECOUPLING MODELING FOR PERSONALIZED END-TO-END SPEECH SYNTHESIS

IMPROVING NATURALNESS AND CONTROLLABILITY OF SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS BY LEARNING LOCAL PROSODY REPRESENTATIONS

Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis

Nonlinear emotional prosody generation and annotation

Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech

DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training

Prosody Analysis And Modeling For Emotional Speech Synthesis

Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis

Hierarchical Prosody Modeling and Control in Non-Autoregressive Parallel Neural TTS

Controllable Emphatic Speech Synthesis Based on Forward Attention for Expressive Speech Synthesis

Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Emotional Prosody Control for Speech Generation

Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis