MAF-Net: Multidimensional Attention Fusion Network for Multichannel Speech Separation

Honglin Li,Qinghua Huang
DOI: https://doi.org/10.1007/s00530-023-01155-1
IF: 3.9
2023-01-01
Multimedia Systems
Abstract:Recent studies have shown that multichannel narrow-band speech separation achieves remarkable performance, while most successful deep learning-based studies directly work on the full-band spectrum of speech. Motivated by these two different but complementary trends, this paper proposes a multidimensional attention fusion network (MAF-Net) to automatically exploit and fuse narrow-band and full-band speech separation information, aiming at enhancing the performance of multichannel speech separation in a reverberation environment. Specifically, it extracts effective narrow-band and full-band information from three dimensions, temporal, spatial and channel, and dynamically integrates them through a feature fusion mechanism. First, the narrow-band feature extractor (NBFE) collects temporal information frame by frame to model context dependency, and it takes multichannel mixed signals of one frequency as input and generates a narrow-band feature map. Then, the multidimensional attention fusion module (MAFM) is proposed to adaptively fuse narrow-band and full-band information from spatial and channel dimensions. Finally, the arrangement of attention modules within the MAFM is designed to maximize the utilization of spatial and channel attention features, including the parallel MAFM (P-MAFM) and sequential MAFM (S-MAFM). Experimental results demonstrate that our proposed method outperforms other advanced methods.
What problem does this paper attempt to address?