Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

Guochang Zhang,Libiao Yu,Chunliang Wang,Jianqiang Wei
DOI: https://doi.org/10.1109/icassp43922.2022.9746610
2022-05-23
Abstract:Speech quality is often degraded by acoustic echoes, background noise, and reverberation. In this paper, we propose a system consisting of deep learning and signal processing to simultaneously suppress echoes, noise, and reverberation. For the deep learning, we design a novel speech dense-prediction backbone. For the signal processing, a linear acoustic echo canceller is used as conditional information for deep learning. To improve the performance of the speech dense-prediction backbone, strategies such as a microphone and reference phase encoder, multi-scale time-frequency processing, and streaming axial attention are designed. The proposed system ranked first in both AEC and DNS Challenge (non-personal track) of ICASSP 2022. In addition, this backbone has also been extended to the multi-channel speech enhancement task, and placed second in ICASSP 2022 L3DAS22 Challenge1.
What problem does this paper attempt to address?