Extracting Spectral Features Using Deep Autoencoders with Binary Distributed Hidden Units for Statistical Parametric Speech Synthesis.

Ya-Jun Hu,Zhen-Hua Ling
DOI: https://doi.org/10.1109/taslp.2018.2791804
2018-01-01
Abstract:This paper presents a spectral feature extraction method using deep autoencoders (DAEs) with binary distributed hidden units (BDAE) for statistical parametric speech synthesis (SPSS). Conventional DAEs are trained to minimize the error of reconstructing raw features. In this paper, we investigate another important property of DAEs that may influence their performances as feature extractors for regression tasks, i.e., the degree of binarization of hidden units. Our analysis shows that making the hidden units ofDAEs to be binary may help alleviate the over-smoothing effect caused by acoustic modeling and parameter generation, which are one of the main deficiencies of current SPSS systems. This paper further proposes an effective BDAE training method by adding noise to the input of hidden units during model training and applying DBN-based pretraining strategies. Our experiments adopt feedforward deep neural networks as acoustic models for SPSS and compare the performances of different spectral feature extractors. Experimental results show that when extracting low-dimensional spectral features by BDAEs, the predicted spectral features can reconstruct spectral envelopes closer to natural samples than using conventional DAEs. Subjective evaluations on the synthetic voices of a Chinese speaker and an English speaker demonstrate that BDAEs achieve better naturalness of synthetic speech than conventional mel-cepstra and other neural network based feature extractors, such as DAEs and DBNs.
What problem does this paper attempt to address?