Abstract:HMM-based automatic segmentation has been popularly used for corpus construction for concatenative speech synthesis. Since the most important reasons for the inaccuracy of HMM-based automatic segmentation are the HMM training criterion and duration control, we will study these particular issues. For the HMM training, we apply the discriminative training method and introduce a new criterion, named Minimum SeGmentation Error (MSGE). In this method, a loss function directly related to the segmentation error is defined, and parameter optimization is performed by the Generalized Probabilistic Descent (GPD) algorithm. For the duration control problem, we apply explicit duration models and propose a two-step-based segmentation method to solve the problem of computational cost, where the duration model is incorporated in a postprocessor procedure. From the experimental results, these two techniques significantly improve segmentation accuracy with different focuses, where the MSGE-based discriminative training focuses on improving the accuracy of sensitive boundary, i.e., a boundary where an error in segmentation is likely to cause a noticeable degradation in speech synthesis quality, and the explicit duration modeling focuses on eliminating large errors. After combining these two techniques, the error average was reduced from 6.86 ms to 5.79 ms on Japanese data, and from 8.67 ms to 6.61 ms on Chinese data. Simultaneously, the number of errors larger than 30 ms were reduced 25% and 51% on Chinese and Japanese data, respectively.

Semi-continuous Segmental Probability Modeling for Continuous Speech Recognition.

State Duration-Based Segmental Probability Model

State Duration-Based Segmental Probability Model for Chinese Speech

A New Model for Speech Recognition : Center-Distance Continuous Probability Model

Center-distance Continuous Probability Models and the Distance Measure.

Speech Verification Based On Simplified Segmental Probability Model

Probabilistic Speaker-Class Based Acoustic Modeling for Large Vocabulary Continuous Speech Recognition

Discriminative Dynamic Gaussian Mixture Selection with Enhanced Robustness and Performance for Multi-Accent Speech Recognition

Modified SPM for speech recognition

Distributed Submodular Maximization for Large Vocabulary Continuous Speech Recognition

Discriminative Training and Explicit Duration Modeling for HMM-based Automatic Segmentation

Utterance verification using modified segmental probability model

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

SPGM: Prioritizing Local Features for enhanced speech separation performance

Multi-Stream Posterior Features and Combining Subspace Gmms for Low Resource Lvcsr

Improved context-dependent acoustic modeling for continuous Chinese speech recognition

A Discriminative Latent Variable Chinese Segmenter with Hybrid Word/Character Information.

Continuous Speech Tokenizer in Text To Speech

Research on Context-Dependent Acoustical Unit (Triphone) for Mandarin Continuous Speech Recognition

Context Dependent Syllable Acoustic Model For Continuous Chinese Speech Recognition

TONE RECOGNITION OF CHINESE CONTINUOUS SPEECH