Abstract:Recent neural speech synthesis systems have gradually focused on the control of prosody to improve the quality of synthesized speech, but they rarely consider the variability of prosody and the correlation between prosody and semantics together. In this paper, a prosody learning mechanism is proposed to model the prosody of speech based on TTS system, where the prosody information of speech is extracted from the melspectrum by a prosody learner and combined with the phoneme sequence to reconstruct the mel-spectrum. Meanwhile, the sematic features of text from the pre-trained language model is introduced to improve the prosody prediction results. In addition, a novel self-attention structure, named as local attention, is proposed to lift this restriction of input text length, where the relative position information of the sequence is modeled by the relative position matrices so that the position encodings is no longer needed. Experiments on English and Mandarin show that speech with more satisfactory prosody has obtained in our model. Especially in Mandarin synthesis, our proposed model outperforms baseline model with a MOS gap of 0.08, and the overall naturalness of the synthesized speech has been significantly improved.

Prosodic Correlation Model in Text-to-Speech Synthesis

Research on Predicting Prosodic Parameters for Chinese Synthesis by Data Mining Approach

Modeling Prosody Patterns for Chinese Expressive Text-to-speech Synthesis

Prosody Model for Mandarin Text-to-Speech System

Prosodic Modeling with Rich Syntactic Context in HMM-based Mandarin Speech Synthesis

A New Prosodic Strength Calculation Method for Prosody Reduction Modeling

The Study of the Trainable Prosodic Model for Chinese Text to Speech System

Learning Prosodic Patterns for Mandarin Speech Synthesis

Prosody Modelling with Pre-trained Cross-utterance Representations for Improved Speech Synthesis

Pitch Models of Mandarin Text-to-speech

Study of Prosody Model on Chinese Speech Synthesis Based on the Classification of Syllabic Prosody Features

A Superposed Prosodic Model for Chinese Text-To-Speech Synthesis

Statistical Acoustic Model Based Unit Selection Algorithm for Speech Synthesis

HIERARCHICAL PROSODY MODELING FOR NON-AUTOREGRESSIVE SPEECH SYNTHESIS

A novel unit selection method for concatenation speech system using similarity measure

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

Improved unit selection speech synthesis method utilizing subjective evaluation results on synthetic speech

Statistical Model Based on Probability Frequency for Mandarin Prosodic Structure Prediction

Hierarchical Non-Uniform Unit Selection Based on Prosodic Structure

A Novel Prosody Adaptation Method for Mandarin Concatenation-Based Text-to-speech System