Abstract:In this paper, an automatic and unsupervised method using context-dependent hidden Markov models (CD-HMMs) is proposed for the prosodic labeling of speech synthesis databases. This method consists of three main steps, i.e., initialization, model training and prosodic labeling. The initial prosodic labels are obtained by unsupervised clustering using the acoustic features designed according to the characteristics of the prosodic descriptor to be labeled. Then, CD-HMMs of the spectral parameters, Rh and phone durations are estimated by a means similar to the HMM-based parametric speech synthesis using the initial prosodic labels. These labels are further updated by Viterbi decoding under the maximum likelihood criterion given the acoustic feature sequences and the trained CD-HMMs. The model training and prosodic labeling procedures are conducted iteratively until convergence. The performance of the proposed method is evaluated on Mandarin speech synthesis databases and two prosodic descriptors are investigated, i.e., the prosodic phrase boundary and the emphasis expression. In our implementation, the prosodic phrase boundary labels are initialized by clustering the durations of the pauses between every two consecutive prosodic words, and the emphasis expression labels are initialized by examining the differences between the original and the synthetic F0 trajectories. Experimental results show that the proposed method is able to label the prosodic phrase boundary positions much more accurately than the text-analysis-based method without requiring any manually labeled training data. The unit selection speech synthesis system constructed using the prosodic phrase boundary labels generated by our proposed method achieves similar performance to that using the manual labels. Furthermore, the unit selection speech synthesis system constructed using the emphasis expression labels generated by our proposed method can convey the emphasis information effectively while maintaining the naturalness of synthetic speech.

Automatic Prosodic Boundary Labeling Based on Fusing the Silence Duration with the Lexical Features

Acoustic and Linguistic Information Based Chinese Prosodic Boundary Labelling

Automatic Prosodic Structure Labeling using DNN-BGRU-CRF Hybrid Neural Network.

Automatic Phrase Boundary Labeling for Mandarin TTS Corpus Using Context-Dependent HMM.

Automatic Prosody Annotation with Pre-Trained Text-Speech Model

Improving Prosodic Boundaries Prediction For Mandarin Speech Synthesis By Using Enhanced Embedding Feature And Model Fusion Approach

Automatic segmentation of Chinese Mandarin speech into syllable-like

A Maximum Entropy Based Hierarchical Model for Automatic Prosodic Boundary Labeling in Mandarin

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Unsupervised Prosodic Phrase Boundary Labeling Of Mandarin Speech Synthesis Database Using Context-Dependent Hmm

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Automatic Phrase Boundary Labeling for a Mandarin TTS Corpus Using the Viterbi Decoding Algorithm

Unsupervised Prosodic Labeling Of Speech Synthesis Databases Using Context-Dependent Hmms

Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features

Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction with Multi-Modal Embeddings

Investigating Effect of Rich Syntactic Features on Mandarin Prosodic Boundaries Prediction

Automatic Phrase Boundary Labeling of Speech Synthesis Database Using Context-Dependent HMMs and N-Gram Prior Distributions

Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

A Superposed Prosodic Model for Chinese Text-To-Speech Synthesis

Chinese Prosodic Phrasing with Extended Features.

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP