Abstract:In this paper, an automatic and unsupervised method using context-dependent hidden Markov models (CD-HMMs) is proposed for the prosodic labeling of speech synthesis databases. This method consists of three main steps, i.e., initialization, model training and prosodic labeling. The initial prosodic labels are obtained by unsupervised clustering using the acoustic features designed according to the characteristics of the prosodic descriptor to be labeled. Then, CD-HMMs of the spectral parameters, Rh and phone durations are estimated by a means similar to the HMM-based parametric speech synthesis using the initial prosodic labels. These labels are further updated by Viterbi decoding under the maximum likelihood criterion given the acoustic feature sequences and the trained CD-HMMs. The model training and prosodic labeling procedures are conducted iteratively until convergence. The performance of the proposed method is evaluated on Mandarin speech synthesis databases and two prosodic descriptors are investigated, i.e., the prosodic phrase boundary and the emphasis expression. In our implementation, the prosodic phrase boundary labels are initialized by clustering the durations of the pauses between every two consecutive prosodic words, and the emphasis expression labels are initialized by examining the differences between the original and the synthetic F0 trajectories. Experimental results show that the proposed method is able to label the prosodic phrase boundary positions much more accurately than the text-analysis-based method without requiring any manually labeled training data. The unit selection speech synthesis system constructed using the prosodic phrase boundary labels generated by our proposed method achieves similar performance to that using the manual labels. Furthermore, the unit selection speech synthesis system constructed using the emphasis expression labels generated by our proposed method can convey the emphasis information effectively while maintaining the naturalness of synthetic speech.

A Full Training Framework of Cross-Stream Dependence Modelling for HMM-based Singing Voice Synthesis

Cross-stream Dependency Modeling Using Continuous F0 Model for HMM-based Speech Synthesis

Cross-Stream Dependency Modeling for HMM-Based Speech Synthesis

Unsupervised Prosodic Labeling Of Speech Synthesis Databases Using Context-Dependent Hmms

Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units

DJCM: A Deep Joint Cascade Model for Singing Voice Separation and Vocal Pitch Estimation

A Hierarchical F0 Modeling Method for HMM-based Speech Synthesis

A Systematic Exploration of Joint-training for Singing Voice Synthesis

Unsupervised Prosodic Phrase Boundary Labeling Of Mandarin Speech Synthesis Database Using Context-Dependent Hmm

UniSinger: Unified End-to-End Singing Voice Synthesis with Cross-Modality Information Matching

Statistical modeling of syllable-level F0 features for HMM-based unit selection speech synthesis

SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models

Cross Validation and Minimum Generation Error for Improved Model Clustering in HMM-based TTS

Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Modeling Pitch Trajectory by Hierarchical HMM with Minimum Generation Error Training.

LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance

Improving the Performance of HMM-based Voice Conversion Using Context Clustering Decision Tree and Appropriate Regression Matrix Format.

Voiced/unvoiced Decision Algorithm for HMM-based Speech Synthesis

HMM-based Unit Selection Using F

Modeling Spectral Envelopes Using Deep Conditional Restricted Boltzmann Machines for Statistical Parametric Speech Synthesis.

Standalone Training of Context-Dependent Deep Neural Network Acoustic Models