Abstract:The redundant convolutional encoder-decoder network has been proven useful in speech enhancement tasks. This network can capture the localized time-frequency details of speech signals through the fully convolutional network structure and the feature selection capability that results from the encoder-decoder mechanism. However, extracting informative features, which we regard as important for the representational capability of speech enhancement models, is not considered explicitly. To solve this problem, we introduce the attention mechanism into the convolutional encoder-decoder model to explicitly emphasize useful information from three aspects, namely, channel, space, and concurrent space-and-channel. Furthermore, the attention operation is specifically achieved through the squeeze-and-excitation mechanism and its variants. The model can adaptively emphasize valuable information and suppress useless ones by assigning weights from different perspectives according to global information, thereby improving its representational capability. Experimental results show that the proposed attention mechanisms can employ a small fraction of parameters to effectively improve the performance of CNN-based models compared with their normal versions, and generalize well to unseen noises, signal-to-noise ratios (SNR) and speakers. Among these mechanisms, the concurrent space-channel-wise attention exhibits the most significant improvement. And when comparing with the state-of-the-art, they can produce comparable or better results. We also integrate the proposed attention mechanisms with other convolutional neural network (CNN)-based models and gain performance. Moreover, we visualize the enhancement results to show the effect of the attention mechanisms more clearly.

Single-channel Speech Enhancement Using Multi-Task Learning and Attention Mechanism

Single Channel Speech Enhancement Using Temporal Convolutional Recurrent Neural Networks.

A Real-Time Speech Enhancement Algorithm Based on Convolutional Recurrent Network and Wiener Filter

Monaural Speech Enhancement Using a Multi-Branch Temporal Convolutional Network

Single-Channel Speech Enhancement Algorithm Based on ME-MGCRN in Low Signal-to-Noise Scenario

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Incorporating Multi-Target in Multi-Stage Speech Enhancement Model for Better Generalization

Multi-Loss Convolutional Network with Time-Frequency Attention for Speech Enhancement

Single-channel speech enhancement using improved progressive deep neural network and masking-based harmonic regeneration

FB-MSTCN: A Full-Band Single-Channel Speech Enhancement Method Based on Multi-Scale Temporal Convolutional Network

Single-channel Speech Enhancement Student under Multi-channel Speech Enhancement Teacher

Multi-task Joint-Learning of Deep Neural Networks for Robust Speech Recognition

An Attention-Based Neural Network Approach For Single Channel Speech Enhancement

Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses

Monaural Speech Enhancement Using Deep Multi-Branch Residual Network with 1-D Causal Dilated Convolutions

Combining Multi-Perspective Attention Mechanism With Convolutional Networks for Monaural Speech Enhancement

Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Shared Network for Speech Enhancement Based on Multi-Task Learning.

Multi-task single channel speech enhancement using speech presence probability as a secondary task training target