Abstract:A novel spectral modeling method for statistical parametric speech synthesis using a hidden trajectory model (HTM) is presented in this paper. An HTM is a structured generative model with a two-stage implementation. First hidden formant trajectories are generated from time-aligned formant target sequences by a bidirectional filter. This target-filtering model could provide a correlation structure across temporal frames and describe the effect of co-articulation on speech signals efficiently. Then the observed cepstral features are constituted by a formant-related component and a residual component. The formant-related component is predicted from hidden formant trajectories using a nonlinear and analytical function, and the prediction residuals are modeled by context-dependent Gaussians. In this paper, we apply HTM-based acoustic modeling to speech synthesis and investigate the effectiveness of this method in improving the naturalness and controllability of synthetic speech. Experimental results show that this proposed method can improve the accuracy of spectral feature prediction and the naturalness of synthetic speech compared with the conventional HMM-based method, especially for the conditions where the amount of training data is limited. Furthermore, this method can achieve effective controllability on vowel quality and formant sharpness of synthetic speech by conveniently manipulating the distribution parameters for the phone-dependent targets of formant frequencies and bandwidths. (C) 2015 Elsevier B.V. All rights reserved.

Modeling Pitch Trajectory by Hierarchical HMM with Minimum Generation Error Training.

A Hierarchical F0 Modeling Method for HMM-based Speech Synthesis

Investigation of Prosodie FO Layers in Hierarchical FO Modeling for HMM-based Speech Synthesis

Learning Virtual HD Model for Bi-model Emotional Speaker Recognition

Full HMM Training for Minimizing Generation Error in Synthesis

Minimum Generation Error Training for HMM-Based Speech Synthesis

Cross Validation and Minimum Generation Error for Improved Model Clustering in HMM-based TTS

Statistical modeling of syllable-level F0 features for HMM-based unit selection speech synthesis

Cross-Validation and Minimum Generation Error Based Decision Tree Pruning for HMM-based Speech Synthesis

Minimum Generation Error Training for HMM-based Prediction of Articulatory Movements

Multi-Layer F0 Modeling for HMM-Based Speech Synthesis

Modeling DCT Parameterized F0 Trajectory at Intonation Phrase Level with DNN or Decision Tree

Modeling F0 Trajectories in Hierarchically Structured Deep Neural Networks.

Hierarchical English Emphatic Speech Synthesis Based on HMM with Limited Training Data.

Model Adaptation for HMM-Based Speech Synthesis under Minimum Generation Error Criterion

Statistical Parametric Speech Synthesis Using a Hidden Trajectory Model

Minimum generation error training with weighted Euclidean distance on LSP for HMM-based speech synthesis

Generating Emphasis from Neutral Speech Using Hierarchical Perturbation Model by Decision Tree and Support Vector Machine

GMM-HMM Acoustic Model Training by a Two Level Procedure with Gaussian Components Determined by Automatic Model Selection

A Full Training Framework of Cross-Stream Dependence Modelling for HMM-based Singing Voice Synthesis

An initial research: Towards accurate pitch extraction for speech synthesis based on BLSTM