Abstract:This paper presents a new spectral modeling method for statistical parametric speech synthesis. In the conventional methods, high-level spectral parameters, such as mel-cepstra or line spectral pairs, are adopted as the features for hidden Markov model (HMM)-based parametric speech synthesis. Our proposed method described in this paper improves the conventional method in two ways. First, distributions of low-level, un-transformed spectral envelopes (extracted by the STRAIGHT vocoder) are used as the parameters for synthesis. Second, instead of using single Gaussian distribution, we adopt the graphical models with multiple hidden variables, including restricted Boltzmann machines (RBM) and deep belief networks (DBN), to represent the distribution of the low-level spectral envelopes at each HMM state. At the synthesis time, the spectral envelopes are predicted from the RBM-HMMs or the DBN-HMMs of the input sentence following the maximum output probability parameter generation criterion with the constraints of the dynamic features. A Gaussian approximation is applied to the marginal distribution of the visible stochastic variables in the RBM or DBN at each HMM state in order to achieve a closed-form solution to the parameter generation problem. Our experimental results show that both RBM-HMM and DBN-HMM are able to generate spectral envelope parameter sequences better than the conventional Gaussian-HMM with superior generalization capabilities and that DBN-HMM and RBM-HMM perform similarly due possibly to the use of Gaussian approximation. As a result, our proposed method can significantly alleviate the over-smoothing effect and improve the naturalness of the conventional HMM-based speech synthesis system using mel-cepstra.

EXTRACTING STRUCTURAL SPECTRAL FEATURES USING WHAT-WHERE AUTO-ENCODERS FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS

Forensic Speech Enhancement Based on Two-Dimensional Fractional Fourier Transform Domain

Extracting Spectral Features Using Deep Autoencoders with Binary Distributed Hidden Units for Statistical Parametric Speech Synthesis.

A Waveform Representation Framework for High-quality Statistical Parametric Speech Synthesis

Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis

Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis

Deep Belief Network-Based Post-Filtering For Statistical Parametric Speech Synthesis

Spectral Modeling Using Neural Autoregressive Distribution Estimators for Statistical Parametric Speech Synthesis

DBN-based Spectral Feature Representation for Statistical Parametric Speech Synthesis

Enhancing dysarthria speech feature representation with empirical mode decomposition and Walsh-Hadamard transform

Extracting and Predicting Word-Level Style Variations for Speech Synthesis

Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification

Learning Deep and Wide Contextual Representations Using BERT for Statistical Parametric Speech Synthesis

Modeling Spectral Envelopes Using Deep Conditional Restricted Boltzmann Machines for Statistical Parametric Speech Synthesis.

Integrating Discrete Word-Level Style Variations into Non-Autoregressive Acoustic Models for Speech Synthesis

Speech Enhancement Based On Analysis Synthesis Framework With Improved Pitch Estimation And Spectral Envelope Enhancement

Auditory model-based speech feature extraction and its application to speaker identification

A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

A Joint Framework of Denoising Autoencoder and Generative Vocoder for Monaural Speech Enhancement

Statistical Acoustic Model Based Unit Selection Algorithm for Speech Synthesis

Speech Enhancement Based on Analysis–Synthesis Framework with Improved Parameter Domain Enhancement