Abstract:Recent researches have shown superior performance of applying end-to-end architecture in text-to-speech (TTS) synthesis. However, considering the complex linguistic structure of Chinese, using Chinese characters directly for Mandarin TTS may suffer from the poor linguistic encoding performance, resulting in improper word tokenization and pronunciation errors. To ensure the naturalness and intelligibility of synthetic speech, state-of-the-art Mandarin TTS systems employ a list of components, such as word tokenization, part-of-speech (POS) tagging and grapheme-to-phoneme (G2P) conversion, to produce knowledge-enhanced inputs to alleviate the problems caused by linguistic encoding. These components are based on linguistic expertise and well-designed, but trained individually, leading to errors compounding for the TTS system. In this paper, to reduce the complexity of Mandarin TTS system and bring further improvement, we proposed a knowledge-based linguistic encoder for the character-based end-to-end Mandarin TTS system. Developed with multi-task learning structure, the proposed encoder can learn from linguistic analysis subtasks, providing robust and discriminative linguistic encodings for the following speech generation decoder. Experimental results demonstrate the effectiveness of the proposed framework, with word tokenization error dropped from 12.81% to 1.58%, syllable pronunciation error dropped from 10.89% to 2.81% compared with state-of-the-art baseline approach, providing mean opinion score (MOS) improvement from 3.76 to 3.87.

Improving Prosodic Boundaries Prediction For Mandarin Speech Synthesis By Using Enhanced Embedding Feature And Model Fusion Approach

Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction with Multi-Modal Embeddings

Blstm-Crf Based End-To-End Prosodic Boundary Prediction With Context Sensitive Embeddings In A Text-To-Speech Front-End

Parsing Hierarchical Prosodic Structure For Mandarin Speech Synthesis

Prosodic boundary prediction based on maximum entropy model with error-driven modification

Prosodic Word Boundaries Prediction for Mandarin Text-to-Speech

Investigating Effect of Rich Syntactic Features on Mandarin Prosodic Boundaries Prediction

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

A Character-level Span-based Model for Mandarin Prosodic Structure Prediction

A Novel Hybrid Approach for Mandarin Speech Synthesis

A Maximum Entropy Based Hierarchical Model for Automatic Prosodic Boundary Labeling in Mandarin

Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features

Pitch Prediction for Mandarin TTS with Mutual Prosodic Constraint

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Phoneme Dependent Speaker Embedding And Model Factorization For Multi-Speaker Speech Synthesis And Adaptation

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

Prosodic Modeling with Rich Syntactic Context in HMM-based Mandarin Speech Synthesis

Improving F0 prediction using bidirectional associative memories and syllable-level F0 features for HMM-based Mandarin speech synthesis

Knowledge-based Linguistic Encoding for End-to-End Mandarin Text-to-Speech Synthesis

Multi-Task Learning for Prosodic Structure Generation Using BLSTM RNN with Structured Output Layer.