Abstract:End-to-end text-to-speech synthesis (TTS), which generates speech sounds directly from strings of texts or phonemes, has improved the quality of speech synthesis over the conventional TTS. However, most previous studies have been evaluated based on subjective naturalness and have not objectively examined whether they can reproduce pitch patterns of phonological phenomena such as downstep, rhythmic boost, and initial lowering that reflect syntactic structures in Japanese. These phenomena can be linguistically explained by phonological constraints and the syntax$\unicode{x2013}$prosody mapping hypothesis (SPMH), which assumes projections from syntactic structures to phonological hierarchy. Although some experiments in psycholinguistics have verified the validity of the SPMH, it is crucial to investigate whether it can be implemented in TTS. To synthesize linguistic phenomena involving syntactic or phonological constraints, we propose a model using phonological symbols based on the SPMH and prosodic well-formedness constraints. Experimental results showed that the proposed method synthesized similar pitch patterns to those reported in linguistics experiments for the phenomena of initial lowering and rhythmic boost. The proposed model efficiently synthesizes phonological phenomena in the test data that were not explicitly included in the training data.

HIERARCHICAL PROSODY MODELING FOR NON-AUTOREGRESSIVE SPEECH SYNTHESIS

Hierarchical Prosody Modeling and Control in Non-Autoregressive Parallel Neural TTS

Prosodic Modeling with Rich Syntactic Context in HMM-based Mandarin Speech Synthesis

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

Modeling Prosody Patterns for Chinese Expressive Text-to-speech Synthesis

Hierarchical Prosody Analysis and Modeling for Emotional Conversions

Study of Prosody Model on Chinese Speech Synthesis Based on the Classification of Syllabic Prosody Features

Learning Prosodic Patterns for Mandarin Speech Synthesis

Parsing Hierarchical Prosodic Structure For Mandarin Speech Synthesis

Investigation of Prosodie FO Layers in Hierarchical FO Modeling for HMM-based Speech Synthesis

Prosody Analysis And Modeling For Emotional Speech Synthesis

Prosody Model for Mandarin Text-to-Speech System

Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit

Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

Applying Syntax$\unicode{x2013}$Prosody Mapping Hypothesis and Prosodic Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis

The Study of the Trainable Prosodic Model for Chinese Text to Speech System

An Innovative Prosody Modeling Method for Chinese Speech Recognition

A Hierarchical F0 Modeling Method for HMM-based Speech Synthesis

An Optimized Neural Network Based Prosody Model of Chinese Speech Synthesis System

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

Modeling prosody pattern of Chinese expressive speech and its application in personalized speech conversion