Abstract:In this paper, an automatic and unsupervised method using context-dependent hidden Markov models (CD-HMMs) is proposed for the prosodic labeling of speech synthesis databases. This method consists of three main steps, i.e., initialization, model training and prosodic labeling. The initial prosodic labels are obtained by unsupervised clustering using the acoustic features designed according to the characteristics of the prosodic descriptor to be labeled. Then, CD-HMMs of the spectral parameters, Rh and phone durations are estimated by a means similar to the HMM-based parametric speech synthesis using the initial prosodic labels. These labels are further updated by Viterbi decoding under the maximum likelihood criterion given the acoustic feature sequences and the trained CD-HMMs. The model training and prosodic labeling procedures are conducted iteratively until convergence. The performance of the proposed method is evaluated on Mandarin speech synthesis databases and two prosodic descriptors are investigated, i.e., the prosodic phrase boundary and the emphasis expression. In our implementation, the prosodic phrase boundary labels are initialized by clustering the durations of the pauses between every two consecutive prosodic words, and the emphasis expression labels are initialized by examining the differences between the original and the synthetic F0 trajectories. Experimental results show that the proposed method is able to label the prosodic phrase boundary positions much more accurately than the text-analysis-based method without requiring any manually labeled training data. The unit selection speech synthesis system constructed using the prosodic phrase boundary labels generated by our proposed method achieves similar performance to that using the manual labels. Furthermore, the unit selection speech synthesis system constructed using the emphasis expression labels generated by our proposed method can convey the emphasis information effectively while maintaining the naturalness of synthetic speech.

Clustering and Feature Learning Based F0 Prediction for Chinese Speech Synthesis

Auditive Learning Based Chinese F0 Prediction

F0 Prediction Model of Speech Synthesis Based on Template and Statistical Method

Learning Prosodic Patterns for Mandarin Speech Synthesis

An Optimized Neural Network Based Prosody Model of Chinese Speech Synthesis System

A Superposed Prosodic Model for Chinese Text-To-Speech Synthesis

The Statistical Model of Chinese Word Contours Based on Fuzzy Clustering Method

Study of Prosody Model on Chinese Speech Synthesis Based on the Classification of Syllabic Prosody Features

Improving F0 prediction using bidirectional associative memories and syllable-level F0 features for HMM-based Mandarin speech synthesis

Modeling F0 Trajectories in Hierarchically Structured Deep Neural Networks.

Modeling Prosody Patterns for Chinese Expressive Text-to-speech Synthesis

AUTOMATIC STRESS PREDICTION OF CHINESE SPEECH SYNTHESIS

Investigation of Prosodie FO Layers in Hierarchical FO Modeling for HMM-based Speech Synthesis

Asynchronous F0 and Spectrum Modeling for HMM-based Speech Synthesis

A Hierarchical F0 Modeling Method for HMM-based Speech Synthesis

Modeling prosody pattern of Chinese expressive speech and its application in personalized speech conversion

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

A Novel Hybrid Approach for Mandarin Speech Synthesis

Unsupervised Prosodic Labeling Of Speech Synthesis Databases Using Context-Dependent Hmms

Data mining for learning mandarin prosodic models

Prosody Analysis And Modeling For Emotional Speech Synthesis