Abstract:End-to-end text-to-speech synthesis systems achieved immense success in recent times, with improved naturalness and intelligibility. However, the end-to-end models, which primarily depend on the attention-based alignment, do not offer an explicit provision to modify/incorporate the desired prosody while synthesizing the speech. Moreover, the state-of-the-art end-to-end systems use autoregressive models for synthesis, making the prediction sequential. Hence, the inference time and the computational complexity are quite high. This paper proposes Prosody-TTS, a data-efficient end-to-end speech synthesis model that combines the advantages of statistical parametric models and end-to-end neural network models. It also has a provision to modify or incorporate the desired prosody at the finer level by controlling the fundamental frequency ( ) and the phone duration. Generating speech utterances with appropriate prosody and rhythm helps in improving the naturalness of the synthesized speech. We explicitly model the duration of the phoneme and the to have a finer level control over them during the synthesis. The model is trained in an end-to-end fashion to directly generate the speech waveform from the input text, which in turn depends on the auxiliary subtasks of predicting the phoneme duration, , and Mel spectrogram. Experiments on the Telugu language data of the IndicTTS database show that the proposed Prosody-TTS model achieves state-of-the-art performance with a mean opinion score of 4.08, with a very low inference time using just 4 hours of training data.

Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

Prosody Analysis of Audiobooks

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling

Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data

Context-Aware Prosody Correction for Text-Based Speech Editing

Investigating context features hidden in End-to-End TTS

On Prosody Modeling for ASR+TTS based Voice Conversion

Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech

Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis

Prosody Modelling with Pre-trained Cross-utterance Representations for Improved Speech Synthesis

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

Applying Syntax$\unicode{x2013}$Prosody Mapping Hypothesis and Prosodic Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis

Improved Prosody Generation by Maximizing Joint Probability of State and Longer Units

Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis

ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading

Automatic Prosody Annotation with Pre-Trained Text-Speech Model

IMPROVING NATURALNESS AND CONTROLLABILITY OF SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS BY LEARNING LOCAL PROSODY REPRESENTATIONS