Abstract:This paper proposes an innovative single-channel supervised speech enhancement (SE) method based on UNET, a convolutional neural network (CNN) architecture that expands on a few changes in the basic CNN architecture. In the training phase, short-time Fourier transform (STFT) is exploited on the noisy time domain signal to build a noisy time-frequency domain signal which is called a complex noisy matrix. We take the real and imaginary parts of the complex noisy matrix and concatenate both of them to form the noisy concatenated matrix. We apply UNET to the noisy concatenated matrix for extracting speech components and train the CNN model. In the testing phase, the same procedure is applied to the noisy time-domain signal as in the training phase in order to construct another noisy concatenated matrix that can be tested using a pre-trained or saved model in order to construct an enhanced concatenated matrix. Finally, from the enhanced concatenated matrix, we separate both the imaginary and real parts to form an enhanced complex matrix. Magnitude and phase are then extracted from the newly created enhanced complex matrix. By using that magnitude and phase, the inverse STFT (ISTFT) can generate the enhanced speech signal. Utilizing the IEEE databases and various types of noise, including stationary and non-stationary noise, the proposed method is evaluated. Comparing the exploratory results of the proposed algorithm to the other five methods of STFT, sparse non-negative matrix factorization (SNMF), dual-tree complex wavelet transform (DTCWT)-SNMF, DTCWT-STFT-SNMF, STFT-convolutional denoising auto encoder (CDAE) and casual multi-head attention mechanism (CMAM) for speech enhancement, we determine that the proposed algorithm generally improves speech quality and intelligibility at all considered signal-to-noise ratios (SNRs). The suggested approach performs better than the other five competing algorithms in every evaluation metric.

Two-Stage UNet with Multi-Axis Gated Multilayer Perceptron for Monaural Noisy-Reverberant Speech Enhancement

Channel and Temporal-Frequency Attention UNet for Monaural Speech Enhancement.

Multinoise-type Blind Denoising Using a Single Uniform Deep Convolutional Neural Network.

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Two-stage unet with channel and temporal-frequency attention for multi-channel speech enhancement

Noisy-Reverberant Speech Enhancement Using DenseUNet with Time-Frequency Attention.

A Two-Stage Deep Neural Network with Bounded Complex Ideal Ratio Masking for Monaural Speech Enhancement

Dual Branch Deep Interactive UNet for Monaural Noisy-Reverberant Speech Enhancement

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Two Heads Are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement.

D2Net: A Denoising and Dereverberation Network Based on Two-branch Encoder and Dual-path Transformer

CAT-DUnet: Enhancing Speech Dereverberation via Feature Fusion and Structural Similarity Loss

Speech Enhancement Using U-Net with Compressed Sensing

Shared Network for Speech Enhancement Based on Multi-Task Learning.

Supervised Single Channel Speech Enhancement Method Using UNET

Research on Speech Enhancement Algorithm Based on SA-Unet

Masking and Inpainting: A Two-Stage Speech Enhancement Approach for Low SNR and Non-Stationary Noise

Interactive Speech and Noise Modeling for Speech Enhancement.

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Multi-Channel Automatic Speech Recognition Using Deep Complex Unet

Multi-resolution Convolutional Residual Neural Networks for Monaural Speech Dereverberation