Abstract:This paper presents a new spectral modeling method for statistical parametric speech synthesis. In the conventional methods, high-level spectral parameters, such as mel-cepstra or line spectral pairs, are adopted as the features for hidden Markov model (HMM)-based parametric speech synthesis. Our proposed method described in this paper improves the conventional method in two ways. First, distributions of low-level, un-transformed spectral envelopes (extracted by the STRAIGHT vocoder) are used as the parameters for synthesis. Second, instead of using single Gaussian distribution, we adopt the graphical models with multiple hidden variables, including restricted Boltzmann machines (RBM) and deep belief networks (DBN), to represent the distribution of the low-level spectral envelopes at each HMM state. At the synthesis time, the spectral envelopes are predicted from the RBM-HMMs or the DBN-HMMs of the input sentence following the maximum output probability parameter generation criterion with the constraints of the dynamic features. A Gaussian approximation is applied to the marginal distribution of the visible stochastic variables in the RBM or DBN at each HMM state in order to achieve a closed-form solution to the parameter generation problem. Our experimental results show that both RBM-HMM and DBN-HMM are able to generate spectral envelope parameter sequences better than the conventional Gaussian-HMM with superior generalization capabilities and that DBN-HMM and RBM-HMM perform similarly due possibly to the use of Gaussian approximation. As a result, our proposed method can significantly alleviate the over-smoothing effect and improve the naturalness of the conventional HMM-based speech synthesis system using mel-cepstra.

Incorporating AM-FM Effect in Voiced Speech for Probabilistic Acoustic Tube Model

Improvement of Probabilistic Acoustic Tube model for speech decomposition

Use Of Particle Filtering And Mcmc For Inference In Probabilistic Acoustic Tube Model

Probabilistic Acoustic Tube: a Probabilistic Generative Model of Speech for Speech Analysis/synthesis

Behavioral Modeling of RF Power amplifiers Using Modified Volterra Series

Phonetic-assisted Multi-Target Units Modeling for Improving Conformer-Transducer ASR system

A New Combined Model of Statics-Dynamics of Speech.

Probabilistic Speaker-Class Based Acoustic Modeling for Large Vocabulary Continuous Speech Recognition

Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis

Integrating Articulatory Features into HMM-Based Parametric Speech Synthesis

Feature-Space Transform Tying in Unified Acoustic-Articulatory Modelling for Articulatory Control of HMM-Based Speech Synthesis.

Pan: Phoneme-Aware Network For Monaural Speech Enhancement

Improvement of hidden Markov model (HMM) for speech recognition

A combined model of statics-dynamics of speech optimized using maximum mutual information

A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis

A New Acoustic Modeling of Inter-Syllable Context-Dependent Units for Putonghua Continuous Speech Recognition

Acoustic statistical modeling based new generation speech synthesis technology

Acoustic Statistical Modeling Based Speech Synthesis Technologies

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

Restricted Boltzmann Machine Based Spectrum Modeling and Unit Selection Speech Synthesis Method

Partial-tied-mixture Auxiliary Chain Models for Speech Recognition Based on Dynamic Bayesian Networks