CST-UNet: Cross Swin Transformer Enhanced U-Net with Masked Bottleneck for Single-Channel Speech Enhancement

Zipeng Zhang,Wei Chen,Weiwei Guo,Yiming Liu,Jianhua Yang,Houguang Liu
DOI: https://doi.org/10.1007/s00034-024-02736-9
IF: 2.311
2024-01-01
Circuits Systems and Signal Processing
Abstract:Speech enhancement performance has improved significantly with the introduction of deep learning models, especially methods based on the Long–Short-Term Memory architecture. However, these methods face challenges such as high computational complexity and redundancy of input features. To address these issues, we propose a U-Net-based approach that utilizes an encoder/decoder to extract more concise features, thereby enhancing single-channel speech performance and reducing computation complexity. The proposed method includes a Cross-Swin-Transformer block and a masked bottleneck module, which down-samples features while preserving the detailed representation through skip connections and carefully designed blocks. The bottleneck module extracts coarse representations of hidden features as masks. We evaluated our method against other U-Net-based approaches on VCTK and DNS corpora using CBAK, eSTOI, PESQ, STOI, and SI-SDR metrics. The results demonstrate that the proposed method achieves promising performance while significantly reducing computational complexity.
What problem does this paper attempt to address?