Abstract:Recent advances of neural TTS have made "human parity" synthesized speech possible when a large amount of studio-quality training data from a voice talent is available. However, with only limited, casual recordings from an ordinary speaker, human-like TTS is still a big challenge, in addition to other artifacts like incomplete sentences, repetition of words, etc. Chinese, a language, of which the text is different from that of other roman-letter based languages like English, has no blank space between adjacent words, hence word segmentation errors can cause serious semantic confusions and unnatural prosody. In this study, with a multi-speaker TTS to accommodate the insufficient training data of a target speaker, we investigate linguistic features and Bert-derived information to improve the prosody of our Mandarin Chinese TTS. Three factors are studied: phone-related and prosody-related linguistic features; better predicted breaks with a refined Bert-CRF model; augmented phoneme sequence with character embedding derived from a Bert model. Subjective tests on in- and out-domain tasks of News, Chat and Audiobook, have shown that all factors are effective for improving prosody of our Mandarin TTS. The model with additional character embeddings from Bert is the best one, which outperforms the baseline by 0.17 MOS gain.

Prosody Model for Mandarin Text-to-Speech System

Pitch Models of Mandarin Text-to-speech

An Optimized Neural Network Based Prosody Model of Chinese Speech Synthesis System

The Study of the Trainable Prosodic Model for Chinese Text to Speech System

Learning Prosodic Patterns for Mandarin Speech Synthesis

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

Modeling Prosody Patterns for Chinese Expressive Text-to-speech Synthesis

A Novel Prosody Adaptation Method for Mandarin Concatenation-Based Text-to-speech System

Mandarin dialog prosody model

Data mining for learning mandarin prosodic models

Automatic Prosody Annotation with Pre-Trained Text-Speech Model

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System

Prosody Analysis And Modeling For Emotional Speech Synthesis

Statistical Model Based on Probability Frequency for Mandarin Prosodic Structure Prediction

Prosody Modification on Mixed-language Speech Synthesis

Prosodic Modeling with Rich Syntactic Context in HMM-based Mandarin Speech Synthesis

A Superposed Prosodic Model for Chinese Text-To-Speech Synthesis

Modeling Incompletion Phenomenon in Mandarin Dialog Prosody.

HIERARCHICAL PROSODY MODELING FOR NON-AUTOREGRESSIVE SPEECH SYNTHESIS

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Pitch Prediction for Mandarin TTS with Mutual Prosodic Constraint