Abstract:HMM-based automatic segmentation has been popularly used for corpus construction for concatenative speech synthesis. Since the most important reasons for the inaccuracy of HMM-based automatic segmentation are the HMM training criterion and duration control, we will study these particular issues. For the HMM training, we apply the discriminative training method and introduce a new criterion, named Minimum SeGmentation Error (MSGE). In this method, a loss function directly related to the segmentation error is defined, and parameter optimization is performed by the Generalized Probabilistic Descent (GPD) algorithm. For the duration control problem, we apply explicit duration models and propose a two-step-based segmentation method to solve the problem of computational cost, where the duration model is incorporated in a postprocessor procedure. From the experimental results, these two techniques significantly improve segmentation accuracy with different focuses, where the MSGE-based discriminative training focuses on improving the accuracy of sensitive boundary, i.e., a boundary where an error in segmentation is likely to cause a noticeable degradation in speech synthesis quality, and the explicit duration modeling focuses on eliminating large errors. After combining these two techniques, the error average was reduced from 6.86 ms to 5.79 ms on Japanese data, and from 8.67 ms to 6.61 ms on Chinese data. Simultaneously, the number of errors larger than 30 ms were reduced 25% and 51% on Chinese and Japanese data, respectively.

A study of duration in continuous speech recognition based on DDBHMM

Duration-Distribution-Based HMM for Speech Recognition

Improved algorithm with duration information for continuous speech recognition

An inhomogeneous HMM speech recognition algorithm

Continuous Speech Recognition Based on the Triphone DDBHMM

One-stage Search Algorithm for Large Vocabulary Continuous Speech Recognition Based on DDBHMM

Digital Speech Recognition Based on DDBHMM

HMM training method based on evolutionary computation and MDI in speech recognition

The speaking rate adaptation algorithm in Putonghua continuous speech recognition

Duration Model for post-processing in a Mandarin speech recognition system

Language Identification between Mandarin and English with State Duration Information

Towards Robustness to Speech Rate in Mandarin All-Syllable Recognition

Algorithm for Mandarin Continuous Speech Recognition Based on Context-Dependent Unit Between Syllables

The Inhomogeneous HMM with General Topological Structure and Its Application in Language Identification between Mandarin and English

Discriminative Training and Explicit Duration Modeling for HMM-based Automatic Segmentation

State Duration-Based Segmental Probability Model for Chinese Speech

Noisy speech recognition performance of discriminative HMMs

DURATION MODELING IN MANDARIN CONNECTED DIGIT RECOGNITION

An unvoiced/voiced duration adjustment algorithm based on context features in mandarin TTS

Using Frame Correlation Algorithm in A Duration Distribution Based Hidden Markov Model

Duration optimization of speaker adaptation in Mandarin TTS