Abstract:HMM-based automatic segmentation has been popularly used for corpus construction for concatenative speech synthesis. Since the most important reasons for the inaccuracy of HMM-based automatic segmentation are the HMM training criterion and duration control, we will study these particular issues. For the HMM training, we apply the discriminative training method and introduce a new criterion, named Minimum SeGmentation Error (MSGE). In this method, a loss function directly related to the segmentation error is defined, and parameter optimization is performed by the Generalized Probabilistic Descent (GPD) algorithm. For the duration control problem, we apply explicit duration models and propose a two-step-based segmentation method to solve the problem of computational cost, where the duration model is incorporated in a postprocessor procedure. From the experimental results, these two techniques significantly improve segmentation accuracy with different focuses, where the MSGE-based discriminative training focuses on improving the accuracy of sensitive boundary, i.e., a boundary where an error in segmentation is likely to cause a noticeable degradation in speech synthesis quality, and the explicit duration modeling focuses on eliminating large errors. After combining these two techniques, the error average was reduced from 6.86 ms to 5.79 ms on Japanese data, and from 8.67 ms to 6.61 ms on Chinese data. Simultaneously, the number of errors larger than 30 ms were reduced 25% and 51% on Chinese and Japanese data, respectively.

Minimum Generation Error Training for HMM-Based Speech Synthesis

Full HMM Training for Minimizing Generation Error in Synthesis

Model Adaptation for HMM-Based Speech Synthesis under Minimum Generation Error Criterion

Minimum Generation Error Training for HMM-based Prediction of Articulatory Movements

Minimum Generation Error Training With Direct Log Spectral Distortion On Lsps For Hmm-Based Speech Synthesis

Minimum generation error training with weighted Euclidean distance on LSP for HMM-based speech synthesis

Minimum Generation Error Training by Using Original Spectrum As Reference for Log Spectral Distortion Measure

Minimum Unit Selection Error Training for HMM-based Unit Selection Speech Synthesis System

HMM training method based on evolutionary computation and MDI in speech recognition

Cross-Validation and Minimum Generation Error Based Decision Tree Pruning for HMM-based Speech Synthesis

Minimum Kullback–Leibler Divergence Parameter Generation for HMM-Based Speech Synthesis

Modeling Pitch Trajectory by Hierarchical HMM with Minimum Generation Error Training.

Cross Validation and Minimum Generation Error for Improved Model Clustering in HMM-based TTS

Preserve ordering property of generated LSPS for minimum generation error training in HMM-based speech synthesis

Training HMMs using a minimum error criterion with different loss measures

Global Variance Modeling on the Log Power Spectrum of LSPs for HMM-based Speech Synthesis

Discriminative Training and Explicit Duration Modeling for HMM-based Automatic Segmentation

Acoustic statistical modeling based new generation speech synthesis technology

Using Generalized Gaussian Distributions to Improve Regression Error Modeling for Deep Learning-Based Speech Enhancement.

Automatic Error Detection For Unit Selection Speech Synthesis Using Log Likelihood Ratio Based Svm Classifier

Soft GPD for Minimum Classification Error Rate Training.