Deep Belief Network-Based Post-Filtering For Statistical Parametric Speech Synthesis

Ya-Jun Hu,Zhen-Hua Ling,Li-Rong Dai
DOI: https://doi.org/10.1109/icassp.2016.7472731
2016-01-01
Abstract:The speech synthesized by statistical parametric speech synthesis (SPSS) always sounds muffled. One important reason is that the generated spectral envelopes are over-smoothed and many detailed spectral structures in natural speech are lost. This paper presents a deep belief network (DBN)-based post-filtering method for hidden Markov model (HMM)-based SPSS to address this issue. At training time, a DBN is estimated using the spectral envelopes extracted from natural speech. This DBN serves as a generatively trained post-filter which processes the spectral envelopes recovered from the predicted spectral features at synthesis time. Experimental results show that the effectiveness of this method depends on the sampling strategy used to generate the training data of the restricted Boltzmann machines (RBM) which forms the higher layers of the DBN. When binary samples are adopted instead of mean-filed approximation, the DBN post-filter can alleviate the over-smoothing effect of parameter generation and improve the naturalness of synthetic speech significantly when either mel-cepstra or line spectral pairs (LSP) are used as spectral features. Its performance is comparative with the parameter generation method with global variance (GV) modeling for melcepstra and better than the LSP-based formant enhancement method used in previous work.
What problem does this paper attempt to address?