Abstract:This paper presents a new spectral modeling method for statistical parametric speech synthesis. In the conventional methods, high-level spectral parameters, such as mel-cepstra or line spectral pairs, are adopted as the features for hidden Markov model (HMM)-based parametric speech synthesis. Our proposed method described in this paper improves the conventional method in two ways. First, distributions of low-level, un-transformed spectral envelopes (extracted by the STRAIGHT vocoder) are used as the parameters for synthesis. Second, instead of using single Gaussian distribution, we adopt the graphical models with multiple hidden variables, including restricted Boltzmann machines (RBM) and deep belief networks (DBN), to represent the distribution of the low-level spectral envelopes at each HMM state. At the synthesis time, the spectral envelopes are predicted from the RBM-HMMs or the DBN-HMMs of the input sentence following the maximum output probability parameter generation criterion with the constraints of the dynamic features. A Gaussian approximation is applied to the marginal distribution of the visible stochastic variables in the RBM or DBN at each HMM state in order to achieve a closed-form solution to the parameter generation problem. Our experimental results show that both RBM-HMM and DBN-HMM are able to generate spectral envelope parameter sequences better than the conventional Gaussian-HMM with superior generalization capabilities and that DBN-HMM and RBM-HMM perform similarly due possibly to the use of Gaussian approximation. As a result, our proposed method can significantly alleviate the over-smoothing effect and improve the naturalness of the conventional HMM-based speech synthesis system using mel-cepstra.

Speech Bandwidth Extension Based on GMM and Clustering Method

Universal background model reduction based efficient speaker recognition

Restoring High Frequency Spectral Envelopes Using Neural Networks For Speech Bandwidth Extension

Multi-Stage Speech Bandwidth Extension with Flexible Sampling Rate Control

Speech Bandwidth Extension Using Bottleneck Features and Deep Recurrent Neural Networks.

A Novel Unified Framework for Speech Enhancement and Bandwidth Extension Based on Jointly Trained Neural Networks

Vector Quantized Diffusion Model Based Speech Bandwidth Extension

Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis

Voice Conversion Based on Gaussian Mixture Modules with Minimum Distance Spectral Mapping

Speaker Segmentation and Clustering Based on the Improved Spectral Clustering

Bispectral feature speech intelligibility assessment metric based on auditory model

Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis

DSP-informed bandwidth extension using locally-conditioned excitation and linear time-varying filter subnetworks

Subband Energy Distance Measure Applied in Multi-Pass Speech/Non-Speech Discrimination

A Novel Research to Artificial Bandwidth Extension Based on Deep BLSTM Recurrent Neural Networks and Exemplar-Based Sparse Representation.

Speech Separation Using Independent Vector Analysis with an Amplitude Variable Gaussian Mixture Model

Noise Robust Speaker Recognition Based on Adaptive Frame Weighting in GMM for i-Vector Extraction.

Restricted Boltzmann Machine Based Spectrum Modeling and Unit Selection Speech Synthesis Method

GMM Based Low-Complexity Adaptive Machine-Learning Equalizers for Optical Fiber Communication

BAE-Net: A Low complexity and high fidelity Bandwidth-Adaptive neural network for speech super-resolution

A hybrid GMM and codebook mapping method for spectral conversion