Abstract:This paper investigates F0 modeling of speech in deep neural networks (DNN) for statistical parametric speech synthesis (SPSS). Recently, DNN has been applied to the acoustic modeling of SPSS and has shown good performance in characterizing complex dependencies between contextual features and acoustic observations. However, the additive nature and long-term suprasegmental property of F0 features have not been fully exploited in the existing DNN-based SPSS. Two different model structures, cascade DNN and parallel DNN are proposed to embody the hierarchical and additive properties of the F0 in DNN-based prosody modeling. In the cascade structure, the DNN-predicted F0 contours of higher levels are used as input to the DNN of the current level. In the parallel structure, F0 components corresponding to different prosody levels are separately generated by DNNs and added together to obtain the final F0 contour. An optimized discrete cosine transform (DCT) is used to extract long-term F0 features at syllable, word, and phrase levels. The experimental results show that our approach yields better subjective performance than either the conventional HMM or DNN approaches. Among all compared systems, the parallel DNN achieves the best objective and subjective performance. (C) 2015 Elsevier B.V. All rights reserved.

Investigation of Prosodie FO Layers in Hierarchical FO Modeling for HMM-based Speech Synthesis

A Hierarchical F0 Modeling Method for HMM-based Speech Synthesis

Multi-Layer F0 Modeling for HMM-Based Speech Synthesis

Modeling F0 Trajectories in Hierarchically Structured Deep Neural Networks.

Modeling Pitch Trajectory by Hierarchical HMM with Minimum Generation Error Training.

HIERARCHICAL PROSODY MODELING FOR NON-AUTOREGRESSIVE SPEECH SYNTHESIS

Statistical modeling of syllable-level F0 features for HMM-based unit selection speech synthesis

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

Parsing Hierarchical Prosodic Structure For Mandarin Speech Synthesis

Prosodic Modeling with Rich Syntactic Context in HMM-based Mandarin Speech Synthesis

Clustering and Feature Learning Based F0 Prediction for Chinese Speech Synthesis

Improving F0 prediction using bidirectional associative memories and syllable-level F0 features for HMM-based Mandarin speech synthesis

A Maximum Entropy Based Hierarchical Model for Automatic Prosodic Boundary Labeling in Mandarin

Cross-Stream Dependency Modeling for HMM-Based Speech Synthesis

Cross-stream Dependency Modeling Using Continuous F0 Model for HMM-based Speech Synthesis

Asynchronous F0 and Spectrum Modeling for HMM-based Speech Synthesis

Modeling DCT Parameterized F0 Trajectory at Intonation Phrase Level with DNN or Decision Tree

Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information

Learning Prosodic Patterns for Mandarin Speech Synthesis

Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

F0 Prediction Model of Speech Synthesis Based on Template and Statistical Method