Abstract:HMM-based automatic segmentation has been popularly used for corpus construction for concatenative speech synthesis. Since the most important reasons for the inaccuracy of HMM-based automatic segmentation are the HMM training criterion and duration control, we will study these particular issues. For the HMM training, we apply the discriminative training method and introduce a new criterion, named Minimum SeGmentation Error (MSGE). In this method, a loss function directly related to the segmentation error is defined, and parameter optimization is performed by the Generalized Probabilistic Descent (GPD) algorithm. For the duration control problem, we apply explicit duration models and propose a two-step-based segmentation method to solve the problem of computational cost, where the duration model is incorporated in a postprocessor procedure. From the experimental results, these two techniques significantly improve segmentation accuracy with different focuses, where the MSGE-based discriminative training focuses on improving the accuracy of sensitive boundary, i.e., a boundary where an error in segmentation is likely to cause a noticeable degradation in speech synthesis quality, and the explicit duration modeling focuses on eliminating large errors. After combining these two techniques, the error average was reduced from 6.86 ms to 5.79 ms on Japanese data, and from 8.67 ms to 6.61 ms on Chinese data. Simultaneously, the number of errors larger than 30 ms were reduced 25% and 51% on Chinese and Japanese data, respectively.

Generating And Evaluating Segmentations For Automatic Speech Recognition Of Conversational Telephone Speech

A Practical Way to Improve Automatic Phonetic Segmentation Performance

Discriminative Training and Explicit Duration Modeling for HMM-based Automatic Segmentation

Extracting Supra-Segment Information for Text-Independent Speaker Verification

Multi-speaker Segmentation and Clustering of Telephone Speech

Automatic Speech Segmentation Combining An Hmm-Based Approach And Recurrence Trend Analysis

Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech

Telephone Conversation Speaker Recogniton System Based on Speech Purify

Speech Decomposition Based on a Hybrid Speech Model and Optimal Segmentation

An Effective Real-Time Audio Segmentation Method Based on Time-Frequency Energy Analysis

Robust Phonetic Segmentation Using Spectral Transition measure for Non-Standard Recording Environments

Speaker Segmentation and Clustering in Meetings

Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition

Automatic Segmentation for TTS Units

Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation

Assessing Segmental Impact for Objective Speech Quality Evaluation.

Automatic Phonetic Segmentation Using HMM Model

Appropriate data segmentation improves speech encoding models

Advances in speaker segmentation and clustering

Mixture Encoder for Joint Speech Separation and Recognition

Semi-continuous Segmental Probability Modeling for Continuous Speech Recognition.