Abstract:The modeling of fundamental frequency, or F0, in HMM-based speech synthesis is a critical factor in delivering speech which is both natural and accurately conveys all of the many nuances of the message. However, F0 modeling is difficult because F0 values are normally considered to depend on a binary voicing decision such that they are continuous in voiced regions and undefined in unvoiced regions. F0 is therefore a discontinuous function of time. Multi-space probability distribution HMM (MSDHMM) is a widely used solution to this problem. The MSDHMM essentially uses a joint distribution of discrete voicing labels and the discontinuous F0 observations. However, due to the discontinuity assumption, the MSDHMM provides a rather weak F0 trajectory model. In this paper, F0 is viewed as being a continuous function of time and this is achieved by assuming that F0 can be observed within unvoiced regions as well as voiced regions. This provides a continuous F0 data stream which can be modeled by standard HMMs. Voicing labels are modeled either implicitly or explicitly in order to perform voicing classification and a globally tied distribution (GTD) technique is used to achieve robust F0 estimation. Both objective measures and subjective listening tests demonstrate that continuous F0 modeling yields better synthesized F0 trajectories and significant improvements to the naturalness of synthesized speech compared to using the MSDHMM model.

An investigation of implementation and performance analysis of DNN based speech synthesis system

An Investigation of Context Clustering for Statistical Speech Synthesis with Deep Neural Network.

Modeling F0 Trajectories in Hierarchically Structured Deep Neural Networks.

Mongolian Text-to-Speech System Based on Deep Neural Network

Deep Neural Network Based Noised Asian Speech Enhancement and Its Implementation on a Hearing Aid App.

Improving Deep Neural Network Based Speech Synthesis Through Contextual Feature Parametrization and Multi-Task Learning

F0 Modeling In Hmm-Based Speech Synthesis System Using Deep Belief Network

Performance Optimization of Speech Recognition System with Deep Neural Network Model

Improving F0 prediction using bidirectional associative memories and syllable-level F0 features for HMM-based Mandarin speech synthesis

Toward a Better Understanding of Deep Neural Network Based Acoustic Modelling: An Empirical Investigation

Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling

Joint Modelling of Voicing Label and Continuous F0 for HMM Based Speech Synthesis

Review of end-to-end speech synthesis technology based on deep learning

DNN-based Stochastic Postfilter for HMM-based Speech Synthesis

An Experimental Study on Speech Enhancement Based on Deep Neural Networks

Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis

A Hierarchical F0 Modeling Method for HMM-based Speech Synthesis

A Bilingual Speech Synthesis System of Standard Malay and Indonesian Based on HMM-DNN.

Review of F0 Modelling and Generation in HMM Based Speech Synthesis

An investigation on DNN-derived bottleneck features for GMM-HMM based robust speech recognition

A Comparison of Expressive Speech Synthesis Approaches based on Neural Network