Abstract:This paper investigates F0 modeling of speech in deep neural networks (DNN) for statistical parametric speech synthesis (SPSS). Recently, DNN has been applied to the acoustic modeling of SPSS and has shown good performance in characterizing complex dependencies between contextual features and acoustic observations. However, the additive nature and long-term suprasegmental property of F0 features have not been fully exploited in the existing DNN-based SPSS. Two different model structures, cascade DNN and parallel DNN are proposed to embody the hierarchical and additive properties of the F0 in DNN-based prosody modeling. In the cascade structure, the DNN-predicted F0 contours of higher levels are used as input to the DNN of the current level. In the parallel structure, F0 components corresponding to different prosody levels are separately generated by DNNs and added together to obtain the final F0 contour. An optimized discrete cosine transform (DCT) is used to extract long-term F0 features at syllable, word, and phrase levels. The experimental results show that our approach yields better subjective performance than either the conventional HMM or DNN approaches. Among all compared systems, the parallel DNN achieves the best objective and subjective performance. (C) 2015 Elsevier B.V. All rights reserved.

Improving F0 prediction using bidirectional associative memories and syllable-level F0 features for HMM-based Mandarin speech synthesis

F0 Transformation for Emotional Speech Synthesis Using Target Approximation Features and Bidirectional Associative Memories

Asynchronous F0 and Spectrum Modeling for HMM-based Speech Synthesis

Statistical modeling of syllable-level F0 features for HMM-based unit selection speech synthesis

Multi-Layer F0 Modeling for HMM-Based Speech Synthesis

A Hierarchical F0 Modeling Method for HMM-based Speech Synthesis

Auditive Learning Based Chinese F0 Prediction

A Novel Hybrid Approach for Mandarin Speech Synthesis

Robust F0 Modeling for Mandarin Speech Recognition in Noise.

Modeling F0 Trajectories in Hierarchically Structured Deep Neural Networks.

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

A Superposed Prosodic Model for Chinese Text-To-Speech Synthesis

A Hierarchical Viterbi Algorithm For Mandarin Hybrid Speech Synthesis System

Duration optimization of speaker adaptation in Mandarin TTS

Cross-Stream Dependency Modeling for HMM-Based Speech Synthesis

Improving Prosodic Boundaries Prediction For Mandarin Speech Synthesis By Using Enhanced Embedding Feature And Model Fusion Approach

Cross-stream Dependency Modeling Using Continuous F0 Model for HMM-based Speech Synthesis

Clustering and Feature Learning Based F0 Prediction for Chinese Speech Synthesis

Formant-Controlled HMM-Based Speech Synthesis.

Mandarin Speech Synthesis Based on Pitch Synchronous Time-Frequency Interpolation

A Novel Hybrid Mandarin Speech Synthesis System Using Different Base Units for Model Training and Concatenation