Abstract:A novel spectral modeling method for statistical parametric speech synthesis using a hidden trajectory model (HTM) is presented in this paper. An HTM is a structured generative model with a two-stage implementation. First hidden formant trajectories are generated from time-aligned formant target sequences by a bidirectional filter. This target-filtering model could provide a correlation structure across temporal frames and describe the effect of co-articulation on speech signals efficiently. Then the observed cepstral features are constituted by a formant-related component and a residual component. The formant-related component is predicted from hidden formant trajectories using a nonlinear and analytical function, and the prediction residuals are modeled by context-dependent Gaussians. In this paper, we apply HTM-based acoustic modeling to speech synthesis and investigate the effectiveness of this method in improving the naturalness and controllability of synthetic speech. Experimental results show that this proposed method can improve the accuracy of spectral feature prediction and the naturalness of synthetic speech compared with the conventional HMM-based method, especially for the conditions where the amount of training data is limited. Furthermore, this method can achieve effective controllability on vowel quality and formant sharpness of synthetic speech by conveniently manipulating the distribution parameters for the phone-dependent targets of formant frequencies and bandwidths. (C) 2015 Elsevier B.V. All rights reserved.

Formant Speech Synthesis Based on Trainable Model

Statistical Parametric Speech Synthesis Using a Hidden Trajectory Model

A hidden trajectory model with bi-directional target filtering: cascaded vs. integrated implementation for phonetic recognition

Formant-Controlled HMM-Based Speech Synthesis.

Amplitude Spectrum Based Excitation Model For Hmm-Based Speech Synthesis

A Hierarchical F0 Modeling Method for HMM-based Speech Synthesis

Multi-Layer F0 Modeling for HMM-Based Speech Synthesis

Target-filtering model based articulatory movement prediction for articulatory control of HMM-based speech synthesis

HMM based speech synthesis with Global Variance Training method

Modeling Pitch Trajectory by Hierarchical HMM with Minimum Generation Error Training.

Cross-Stream Dependency Modeling for HMM-Based Speech Synthesis

Hierarchical Modeling of Spatial Cues via Spherical Harmonics for Multi-Channel Speech Enhancement

Modeling DCT Parameterized F0 Trajectory at Intonation Phrase Level with DNN or Decision Tree

Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis

Pitch-Scaled Spectrum Based Excitation Model for HMM-based Speech Synthesis

Msdtron: a high-capability multi-speaker speech synthesis system for diverse data using characteristic information

Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Feature-Space Transform Tying in Unified Acoustic-Articulatory Modelling for Articulatory Control of HMM-Based Speech Synthesis.

An Unified and Automatic Approach of Mandarin HTS System.

Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

PHMM Based Asynchronous Acoustic Model for Chinese Large Vocabulary Continuous Speech Recognition