Modeling Spectral Envelopes Using Deep Conditional Restricted Boltzmann Machines for Statistical Parametric Speech Synthesis.

Xiang Yin,Zhen-Hua Ling,Ya-Jun Hu,Li-Rong Dai
DOI: https://doi.org/10.1109/icassp.2016.7472654
2016-01-01
Abstract:This paper proposes a spectral modeling method using a deep conditional restricted Boltzmann machine (DCRBM) for statistical parametric speech synthesis. In this method, a DCRBM, which combines a deep neural network (DNN) with a conditional restricted Boltzmann machine (CRBM), is utilized to describe the conditional distribution of spectral envelopes given linguistic features. Compared with DNN and deep mixture density network (DMDN), DCRBM is better at describing the multimodal distribution of high-dimensional acoustic features with cross-dimension correlations. At training stage, the DNN part and the CRBM part of the DCRBM are pre-trained successively and then a unified fine-tuning of all model parameters is conducted. At synthesis time, spectral envelopes are generated from the estimated DCRBM model by iterative sampling and dynamic-feature-constrained parameter generation given linguistic features of input text. Experimental results show that our proposed method can produce more natural speech sounds than the hidden Markov model (HMM)-based, DNN-based, and DMDN-based synthesis methods. This method also outperforms previous work which adopts restricted Boltzmann machines (RBM) to model the distributions of spectral envelopes at HMM states.
What problem does this paper attempt to address?