Abstract:This paper presents a decision tree pruning method for the model clustering of HMM-based parametric speech synthesis by cross-validation (CV) under the minimum generation error (MGE) criterion. Decision-tree-based model clustering is an important component in the training process of an HMM based speech synthesis system. Conventionally, the maximum likelihood (ML) criterion is employed to choose the optimal contextual question from the question set for each tree node split and the minimum description length (MDL) principle is introduced as the stopping criterion to prevent building overly large tree models. Nevertheless, the MDL criterion is derived based on an asymptotic assumption and is problematic in theory when the size of the training data set is not large enough. Besides, inconsistency exists between the MDL criterion and the aim of speech synthesis. Therefore, a minimum cross generation error (MCGE) based decision tree pruning method for HMM-based speech synthesis is proposed in this paper. The initial decision tree is trained by MDL clustering with a factor estimated using the MCGE criterion by cross-validation. Then the decision tree size is tuned by backing-off or splitting each leaf node iteratively to minimize a cross generation error, which is defined to present the sum of generation errors calculated for all training sentences using cross-validation. Objective and subjective evaluation results show that the proposed method outperforms the conventional MDL-based model clustering method significantly.

Model Adaptation for HMM-Based Speech Synthesis under Minimum Generation Error Criterion

Minimum Generation Error Training for HMM-Based Speech Synthesis

Minimum Generation Error Training for HMM-based Prediction of Articulatory Movements

Full HMM Training for Minimizing Generation Error in Synthesis

Agmma: A Novel Incremental Adaptation Method And Its Application To Speaker Recognition

Minimum Generation Error Training With Direct Log Spectral Distortion On Lsps For Hmm-Based Speech Synthesis

Minimum generation error training with weighted Euclidean distance on LSP for HMM-based speech synthesis

MAP-based Speaker Adaptation in Speech Synthesis

Cross-Validation and Minimum Generation Error Based Decision Tree Pruning for HMM-based Speech Synthesis

Speaker adaptation using maximum likelihood model interpolation

Minimum Unit Selection Error Training for HMM-based Unit Selection Speech Synthesis System

Discriminative Speaker Adaptation with Eigenvoices

Cross Validation and Minimum Generation Error for Improved Model Clustering in HMM-based TTS

Linguistic tree based maximum likelihood model interpolation

Using Generalized Gaussian Distributions to Improve Regression Error Modeling for Deep Learning-Based Speech Enhancement.

Minimum Generation Error Training by Using Original Spectrum As Reference for Log Spectral Distortion Measure

Minimum Kullback–Leibler Divergence Parameter Generation for HMM-Based Speech Synthesis

Comparison of adaptation methods for GMM-SVM based speech emotion recognition

A Speaker Adaptation Algorithm Based on Matrix Linear Interpolation

Global Variance Modeling on the Log Power Spectrum of LSPs for HMM-based Speech Synthesis

Training HMMs using a minimum error criterion with different loss measures