Abstract:Syntactic structure of a sentence text is correlated with the prosodic structure of the speech that is crucial for improving the prosody and naturalness of a text-to-speech (TTS) system. Nowadays TTS systems usually try to incorporate syntactic structure information with manually designed features based on expert knowledge. In this paper, we propose a syntactic representation learning method based on syntactic parse tree traversal to automatically utilize the syntactic structure information. Two constituent label sequences are linearized through left-first and right-first traversals from constituent parse tree. Syntactic representations are then extracted at word level from each constituent label sequence by a corresponding uni-directional gated recurrent unit (GRU) network. Meanwhile, nuclear-norm maximization loss is introduced to enhance the discriminability and diversity of the embeddings of constituent labels. Upsampled syntactic representations and phoneme embeddings are concatenated to serve as the encoder input of Tacotron2. Experimental results demonstrate the effectiveness of our proposed approach, with mean opinion score (MOS) increasing from 3.70 to 3.82 and ABX preference exceeding by 17% compared with the baseline. In addition, for sentences with multiple syntactic parse trees, prosodic differences can be clearly perceived from the synthesized speeches.

Multi-Task Learning for Prosodic Structure Generation Using BLSTM RNN with Structured Output Layer

Multi-task Learning of Structured Output Layer Bidirectional LSTMS for Speech Synthesis

Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction with Multi-Modal Embeddings

Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features

Automatic Prosodic Structure Labeling using DNN-BGRU-CRF Hybrid Neural Network.

Blstm-Crf Based End-To-End Prosodic Boundary Prediction With Context Sensitive Embeddings In A Text-To-Speech Front-End

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

A Character-level Span-based Model for Mandarin Prosodic Structure Prediction

Rule-learning Based Prosodic Structure Prediction

Improving Prosodic Boundaries Prediction For Mandarin Speech Synthesis By Using Enhanced Embedding Feature And Model Fusion Approach

Prosodic Structure Prediction Using Deep Self-attention Neural Network

Learning Prosodic Patterns for Mandarin Speech Synthesis

Automatic Prosody Annotation with Pre-Trained Text-Speech Model

Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Acoustic and Linguistic Information Based Chinese Prosodic Boundary Labelling

Parsing Hierarchical Prosodic Structure For Mandarin Speech Synthesis

Learning rules for Chinese prosodic phrase prediction

Syntactic Representation Learning for Neural Network Based TTS with Syntactic Parse Tree Traversal