Hybrid Attention Time-Frequency Analysis Network for Single-Channel Speech Enhancement.

Zehua Zhang,Xingwei Liang,Ruifeng Xu,Mingjiang Wang
DOI: https://doi.org/10.1109/ICASSP48485.2024.10445944
2024-01-01
Abstract:The time-frequency domain remains central to the speech signal analysis. Enhancing the efficacy of neural network-based speech models demands a detailed multi-scale analysis of time-frequency features. This study presents the Hybrid Attention Time-Frequency Analysis Network (HATFANet), an innovative model that uses a dual-branch structure to concurrently estimate the ideal ratio mask and the enhanced complex spectrum. Each branch incorporates Hybrid Attention Blocks (HABs) to capture local, global, and inter-window attention for more effective deep feature extraction by employing reshaping techniques and gated multi-layer perceptrons to focus on different attention scales. The addition of residual channel attention and window multi-head self-attention mechanism accentuate channel attention features and intra-window attention. Our experiments verify the pivotal role of these HABs across varied attentional scales. HATFANet achieves state-of-the-art results on the Voice Bank + DEMAND dataset, recording 3.37 PESQ, 95.8% STOI, and 10.15 SSNR.
What problem does this paper attempt to address?