Abstract:Recent advances of neural TTS have made "human parity" synthesized speech possible when a large amount of studio-quality training data from a voice talent is available. However, with only limited, casual recordings from an ordinary speaker, human-like TTS is still a big challenge, in addition to other artifacts like incomplete sentences, repetition of words, etc. Chinese, a language, of which the text is different from that of other roman-letter based languages like English, has no blank space between adjacent words, hence word segmentation errors can cause serious semantic confusions and unnatural prosody. In this study, with a multi-speaker TTS to accommodate the insufficient training data of a target speaker, we investigate linguistic features and Bert-derived information to improve the prosody of our Mandarin Chinese TTS. Three factors are studied: phone-related and prosody-related linguistic features; better predicted breaks with a refined Bert-CRF model; augmented phoneme sequence with character embedding derived from a Bert model. Subjective tests on in- and out-domain tasks of News, Chat and Audiobook, have shown that all factors are effective for improving prosody of our Mandarin TTS. The model with additional character embeddings from Bert is the best one, which outperforms the baseline by 0.17 MOS gain.

The Study of the Trainable Prosodic Model for Chinese Text to Speech System

An Optimized Neural Network Based Prosody Model of Chinese Speech Synthesis System

Prosody Model for Mandarin Text-to-Speech System

Modeling Prosody Patterns for Chinese Expressive Text-to-speech Synthesis

Learning Prosodic Patterns for Mandarin Speech Synthesis

Study of Prosody Model on Chinese Speech Synthesis Based on the Classification of Syllabic Prosody Features

A Superposed Prosodic Model for Chinese Text-To-Speech Synthesis

Statistical Model Based on Probability Frequency for Mandarin Prosodic Structure Prediction

Hierarchical Prosody Modeling and Control in Non-Autoregressive Parallel Neural TTS

HIERARCHICAL PROSODY MODELING FOR NON-AUTOREGRESSIVE SPEECH SYNTHESIS

A Novel Prosody Adaptation Method for Mandarin Concatenation-Based Text-to-speech System

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Modeling prosody pattern of Chinese expressive speech and its application in personalized speech conversion

IMPROVING NATURALNESS AND CONTROLLABILITY OF SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS BY LEARNING LOCAL PROSODY REPRESENTATIONS

Data mining for learning mandarin prosodic models

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

Training Prosodic Phrasing Rules for Chinese TTS Systems

Pitch Models of Mandarin Text-to-speech

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

Modeling the Expressivity of Input Text Semantics for Chinese Text-to-Speech Synthesis in a Spoken Dialog System