Time Domain Speech Enhancement Using Self-Attention-Based Subspace Projection

Ding Zhao,Zhan Zhang,Bin Yu,Yuehai Wang
DOI: https://doi.org/10.1109/iccc54389.2021.9674447
2021-01-01
Abstract:With the broad applications of speech technology, speech enhancement (SE) is becoming more and more important nowadays. Most existing deep-learning-based SE methods use the encoder-decoder architecture, where the encoder first encodes the noisy speech and produces the latent embedding. This latent embedding inevitably contains information of both speech and noise, which may affect the decoder's prediction of clean speech. To deal with it, we propose a projection module based on self-attention (SA), which projects the noisy latent embedding into two orthogonal subspaces: speech-dominant subspace and noise-dominant subspace. After that, the speech latent embedding and noise latent embedding are fed into two decoders to predict speech and noise, respectively. Besides, a gating module is applied to the skip connection to suppress irrelevant information leakage. Finally, we use a merge module at the end of the model, which utilizes the predicted noise to get better SE results. Experimental results on the benchmark dataset show that the proposed SE model outperforms most state-of-the-art models.
What problem does this paper attempt to address?