Abstract:For producing high quality synthesis, a concatenation-based Text-to-Speech (TTS) system usually requires a large number of segmental units to cover various acoustic-phonetic contexts. However, careful manual labeling and segmentation by human experts, which is still the most reliable way to prepare such units, is labor intensive. In this paper we adopt a two-step procedure to automate the labeling, segmentation and refinement process. In the first step, coarse segmentation of speech data is performed by aligning speech signals with the corresponding sequence of Hidden Markov Models (HMMs). Then in the second step, segment boundaries are refined with a proposed Context-Dependent Boundary Model (CDBM). Classification and Regression Tree (CART) is adopted to organize available data into a structured hierarchical tree, where acoustically similar boundaries are clustered together to train tied CDBM models for boundary refinement. Optimal CDBM parameters and training conditions are found through a series of experimental studies. Comparing with manual segmentation reference, segmentation accuracy (within a tolerance of 20 ms) is improved by the CDBMs from 78.1% (baseline) to 94.8% in Mandarin Chinese and from 81.4% to 92.7% in English, with about 1,000 manually segmented sentences used in training the models. To further reduce the amount of manual data for training CDBMs of a new speaker, we adapt a well-trained CDBM via efficient adaptation algorithms. With only 10--20 manually segmented sentences as adaptation data, the adapted CDBM achieves a segmentation accuracy of 90%.

Duration Refinement by Jointly Optimizing State and Longer Unit Likelihood

Duration optimization of speaker adaptation in Mandarin TTS

Duration Model for post-processing in a Mandarin speech recognition system

An unvoiced/voiced duration adjustment algorithm based on context features in mandarin TTS

Differentiable Duration Refinement Using Internal Division for Non-Autoregressive Text-to-Speech

Expressive, Variable, and Controllable Duration Modelling in TTS

A state duration generation algorithm considering global variance for HMM-based speech synthesis

State Duration-Based Segmental Probability Model for Chinese Speech

Improvement of hidden Markov model (HMM) for speech recognition

Discriminative Training and Explicit Duration Modeling for HMM-based Automatic Segmentation

Duration-Distribution-Based HMM for Speech Recognition

Improved algorithm with duration information for continuous speech recognition

End-to-End Text-to-Speech using Latent Duration based on VQ-VAE

State Duration-Based Segmental Probability Model

Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems

HMM Based TTS for Mixed Language Text.

Combining Extreme Learning Machine And Decision Tree For Duration Prediction In Hmm Based Speech Synthesis

Duration Modeling of Neural TTS for Automatic Dubbing

Towards Robustness to Speech Rate in Mandarin All-Syllable Recognition

A study of duration in continuous speech recognition based on DDBHMM