A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis

Ling-Hui Chen,Tuomo Raitio,Cassia Valentini-Botinhao,Zhen-Hua Ling,Junichi Yamagishi
DOI: https://doi.org/10.1109/taslp.2015.2461448
2015-01-01
IEEE/ACM Transactions on Audio Speech and Language Processing
Abstract:The generated speech of hidden Markov model (HMM)-based statistical parametric speech synthesis still sounds "muffled." One cause of this degradation in speech quality may be the loss of fine spectral structures. In this paper, we propose to use a deep generative architecture, a deep neural network (DNN) generatively trained, as a postfilter. The network models the conditional probability of the spectrum of natural speech given that of synthetic speech to compensate for such gap between synthetic and natural speech. The proposed probabilistic postfilter is generatively trained by cascading two restricted Boltzmann machines (RBMs) or deep belief networks (DBNs) with one bidirectional associative memory (BAM). We devised two types of DNN postfilters: one operating in the mel-cepstral domain and the other in the higher dimensional spectral domain. We compare these two new data-driven postfilters with other types of postfilters that are currently used in speech synthesis: a fixed mel-cepstral based postfilter, the global variance based parameter generation, and the modulation spectrum-based enhancement. Subjective evaluations using the synthetic voices of a male and female speaker confirmed that the proposed DNN-based postfilter in the spectral domain significantly improved the segmental quality of synthetic speech compared to that with conventional methods.
What problem does this paper attempt to address?