Multi-stage Strength Estimation Network with Cross Attention for Single Channel Speech Enhancement

Zipeng Zhang,Yuchen Ding,Wei Chen,Yutao Chen,Weiwei Guo,Houguang Liu
DOI: https://doi.org/10.1007/s11760-024-03364-1
IF: 1.583
2024-01-01
Signal Image and Video Processing
Abstract:Speech enhancement is a fundamental task for acoustic signal processing, which is still an unsolved challenge. Recently, with the rapid development of deep learning, data-driven approaches based on a variety of different modules in machine learning have made great progress in speech enhancement. Each of these basic modules have unique advantages as well as certain limitations. Inspired by the blocks’ unique preferences and the distinguishing feature of speech signals, we proposed a multi-stage strength estimation network with cross-attention for single-channel speech enhancement in this paper. The proposed method consists of a feature-wised fusion block using the attention mechanism and the strength estimation block using FFT and sequential representations (FTB). We first describe the speech enhancement problem mathematically, after which we compared the proposed method with some well-known speech enhancement methods on the 50-h DNS and LibriFSD50K dataset, showing that the proposed method can pay full attention to both time and frequency domains and achieve satisfying results. Further ablation studies are also carried out to prove the effectiveness of each section of the proposed method, and the results show the effectiveness of the proposed method. By the exhibit of the proposed method, we show the effectiveness of improving the performance of speech enhancement models by utilizing modules with different properties, which pointing out a promising direction for the future development.
What problem does this paper attempt to address?