Abstract:Speech enhancement is an important method for improving speech quality and intelligibility in noisy environments. An effective speech enhancement model depends on precise modelling of the long-range dependencies of noisy speech. Several recent studies have examined ways to enhance speech by capturing the long-term contextual information. For speech enhancement, the time–frequency (T–F) distribution of speech spectral components is also important, but is usually ignored in these studies. The multi-stage learning method is an effective way to integrate various deep learning modules at the same time. The benefit of multi-stage training is that the optimization target can be iteratively updated stage by stage. In this paper, speech enhancement is investigated by multi-stage learning using a multi-stage structure in which time–frequency attention (TFA) blocks are followed by stacks of squeezed temporal convolutional networks (S-TCN) with exponentially increasing dilation rates. To reinject original information into later stages, a feature fusion block (FB) is inserted at the input of later stages to reduce the possibility of speech information being lost in the early stages. The S-TCN blocks are responsible for temporal sequence modelling tasks. The time–frequency attention (TFA) is a simple but effective network module that explicitly exploits position information to generate a 2D attention map to characterize the salient T–F distribution of speech by using two branches, time-frame attention and frequency attention in parallel. Extensive experiments have demonstrated that the proposed model consistently improves the performance over existing baselines across two widely used objective metrics such as PESQ and STOI. A significant improvement in system robustness to noise is also shown by our evaluation results using the TFA module.

Speech Enhancement Model for High Sampling Rate Speech Datasets Based on Multi-branch Time Convolutional Network

Single Channel Speech Enhancement Based on Temporal Convolutional Network

Single Channel Speech Enhancement Using Temporal Convolutional Recurrent Neural Networks.

Monaural Speech Enhancement Using a Multi-Branch Temporal Convolutional Network

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

FB-MSTCN: A Full-Band Single-Channel Speech Enhancement Method Based on Multi-Scale Temporal Convolutional Network

Group Multi-Scale convolutional Network for Monaural Speech Enhancement in Time-domain

CLOSING THE GAP BETWEEN TIME-DOMAIN MULTI-CHANNEL SPEECH ENHANCEMENT ON REAL AND SIMULATION CONDITIONS

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks

A Real-Time Speech Enhancement Algorithm Based on Convolutional Recurrent Network and Wiener Filter

Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer

SICRN: Advancing Speech Enhancement through State Space Model and Inplace Convolution Techniques

Multi-stage Strength Estimation Network with Cross Attention for Single Channel Speech Enhancement

A Time Domain Progressive Learning Approach with SNR Constriction for Single-Channel Speech Enhancement and Recognition

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

Time Domain Speech Enhancement Using SNR Prediction and Robust Speaker Classification

DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement.

Time domain speech enhancement with CNN and time-attention transformer

PhaseDCN: A Phase-Enhanced Dual-Path Dilated Convolutional Network for Single-Channel Speech Enhancement.