CASE-Net: Integrating local and non-local attention operations for speech enhancement

Xinmeng Xu,Weiping Tu,Yuhong Yang
DOI: https://doi.org/10.1016/j.specom.2023.02.006
IF: 2.723
2023-01-01
Speech Communication
Abstract:Local and non-local attention operations are two ubiquitous operations in the domain of speech enhancement (SE), and they are effective to generate more discriminative patterns from the noisy mixture. However, a noisy speech signal contains many fast-changing and dynamic acoustic features that are hard to precisely capture by using both attention operations indiscriminately. Besides, simply combining local and non-local attention operations is unable to avoid their demerits while keeping their merits in the SE tasks. To tackle these issues, we propose a cooperative attention based SE network (CASE-Net) as an inventive attempt to make a trade-off between local and non-local attention operations for generating more discriminative patterns from local and global speech regions. In addition, since the high computational cost issue in non-local attention, we propose a time–frequency (TF)-wise non-local attention model, in which the 2D non-local attention is divided into two 1D sub-attentions. Therefore, the time–frequency TF-wise non-local attention provides two parallel non-local sub-attentions to separately calculate the attention maps along both the time and frequency axis, as a consequence, the training process is facilitated. Experimental results show the 2 observations that (1) cooperative attention makes an effective trade-off between local and non-local attention operations, and the proposed CASE-Net achieves higher performance than recent models in terms of PESQ and STOI, (2) the proposed TF-wise non-local attention significantly improves the network performance while maintaining a lower computational complexity than the conventional non-local attention.
What problem does this paper attempt to address?