Denoising-and-Dereverberation Hierarchical Neural Vocoder for Statistical Parametric Speech Synthesis

Yang Ai,Zhen-Hua Ling,Wei-Lu Wu,Ang Li
DOI: https://doi.org/10.1109/TASLP.2022.3182268
2022-01-01
Abstract:This paper presents a denoising and dereverberation hierarchical neural vocoder (DNR-HiNet) to convert noisy and reverberant acoustic features into clean speech waveforms. The DNR-HiNet vocoder is built by modifying the amplitude spectrum predictor (ASP) in the original HiNet vocoder. This modified denoising and dereverberation ASP (DNR-ASP) can predict clean log amplitude spectra from input degraded acoustic features. To achieve this, the DNR-ASP first predicts the log amplitude spectra of noisy and reverberant speech, the log amplitude spectra of additive noise and the room impulse response (RIR) and then performs initial denoising and dereverberation by signal processing algorithms. The initially processed log amplitude spectra are then enhanced by another neural network to obtain the final clean log amplitude spectra. We also introduce a bandwidth extension model and a frequency resolution extension model into the DNR-ASP to further improve its performance. Finally, a statistical parametric speech synthesis (SPSS) method with DNR-HiNet is proposed to deal with the situation that the quality of target speaker's recordings is degraded by noise and reverberation. Experimental results indicate that the DNR-HiNet vocoder was able to generate denoised and dereverberated waveforms given noisy and reverberant acoustic features and outperformed the original HiNet vocoder and a few other neural vocoders. On speech enhancement tasks, its performance was competitive with several advanced speech enhancement methods. Furthermore, the SPSS method with DNR-HiNet achieved better quality of synthetic speech than the conventional approach which directly applied speech enhancement to the degraded adaptation data.
What problem does this paper attempt to address?