Abstract:For producing high quality synthesis, a concatenation-based Text-to-Speech (TTS) system usually requires a large number of segmental units to cover various acoustic-phonetic contexts. However, careful manual labeling and segmentation by human experts, which is still the most reliable way to prepare such units, is labor intensive. In this paper we adopt a two-step procedure to automate the labeling, segmentation and refinement process. In the first step, coarse segmentation of speech data is performed by aligning speech signals with the corresponding sequence of Hidden Markov Models (HMMs). Then in the second step, segment boundaries are refined with a proposed Context-Dependent Boundary Model (CDBM). Classification and Regression Tree (CART) is adopted to organize available data into a structured hierarchical tree, where acoustically similar boundaries are clustered together to train tied CDBM models for boundary refinement. Optimal CDBM parameters and training conditions are found through a series of experimental studies. Comparing with manual segmentation reference, segmentation accuracy (within a tolerance of 20 ms) is improved by the CDBMs from 78.1% (baseline) to 94.8% in Mandarin Chinese and from 81.4% to 92.7% in English, with about 1,000 manually segmented sentences used in training the models. To further reduce the amount of manual data for training CDBMs of a new speaker, we adapt a well-trained CDBM via efficient adaptation algorithms. With only 10--20 manually segmented sentences as adaptation data, the adapted CDBM achieves a segmentation accuracy of 90%.

Prosodic Word Boundaries Prediction for Mandarin Text-to-Speech

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System

Improving Prosodic Boundaries Prediction For Mandarin Speech Synthesis By Using Enhanced Embedding Feature And Model Fusion Approach

Prosodic boundary prediction based on maximum entropy model with error-driven modification

Adversarial Multi-Task Learning for Mandarin Prosodic Boundary Prediction with Multi-Modal Embeddings

Automatic Phrase Boundary Labeling for Mandarin TTS Corpus Using Context-Dependent HMM.

The Pause Duration Prediction for Mandarin Text-to-speech System

Statistical Model Based on Probability Frequency for Mandarin Prosodic Structure Prediction

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

Automatic Phrase Boundary Labeling for a Mandarin TTS Corpus Using the Viterbi Decoding Algorithm

Blstm-Crf Based End-To-End Prosodic Boundary Prediction With Context Sensitive Embeddings In A Text-To-Speech Front-End

A Character-level Span-based Model for Mandarin Prosodic Structure Prediction

Pitch Prediction for Mandarin TTS with Mutual Prosodic Constraint

Unsupervised Prosodic Phrase Boundary Labeling Of Mandarin Speech Synthesis Database Using Context-Dependent Hmm

Investigating Effect of Rich Syntactic Features on Mandarin Prosodic Boundaries Prediction

Context-Dependent Boundary Model for Refining Boundaries Segmentation of TTS Units

Prosody Model for Mandarin Text-to-Speech System

A Novel Prosody Adaptation Method for Mandarin Concatenation-Based Text-to-speech System

Prosodic word prediction using a maximum entropy approach

Predicting Chinese Prosodic Word Based on Transformation-Based Error-Driven Learning

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS