Multi-resolution Convolutional Residual Neural Networks for Monaural Speech Dereverberation

Lei Zhao,Wenbo Zhu,Shengqiang Li,Hong Luo,Xiao-Lei Zhang,Susanto Rahardja
DOI: https://doi.org/10.1109/taslp.2024.3385270
2024-01-01
Abstract:It is known that the reverberant speech in different acoustic environments varies according to reverberation time. However, most deep learning based speech dereverberation methods rely on a single deep model to learn the context information. It may make the deep model biased to only part of the reverberant time durations. In this paper, we propose a multi-resolution framework to address this issue. The framework integrates the dereverberant ability of multiple deep subnetworks with different time resolutions into a unified model by transferring the dereverberant information from high-resolution subnetworks to low-resolution subnetworks. By doing so, the unified model can perform well in both long and short reverberant time. We further propose two implementations of the framework based on advanced convolutional residual neural networks. The first implementation, named multi-resolution UNet, uses our new implementation of UNet based on convolutional blocks as the dereverberation subnetwork. The second implementation, named multi-resolution stacked convolutional blocks, uses our new stacked convolutional blocks as the subnetwork. Experimental results in both simulated and real-world environments show that the proposed algorithms outperform the state-of-the-art dereverberation methods in terms of both the evaluation metrics for speech dereverberation and word error rate (WER) for speech recognition.
What problem does this paper attempt to address?