Abstract:Abstract Speech is easily interfered by external environment in reality, which results in the loss of important features. Deep learning has become a popular speech enhancement method because of its superior potential in solving nonlinear mapping problems for complex features. However, the deficiency of traditional deep learning methods is the weak learning capability of important information from previous time steps and long-term event dependencies between the time-series data. To overcome this problem, we propose a novel speech enhancement method based on the fused features of deep neural networks (DNNs) and gated recurrent unit (GRU). The proposed method uses GRU to reduce the number of parameters of DNNs and acquire the context information of the speech, which improves the enhanced speech quality and intelligibility. Firstly, DNN with multiple hidden layers is used to learn the mapping relationship between the logarithmic power spectrum (LPS) features of noisy speech and clean speech. Secondly, the LPS feature of the deep neural network is fused with the noisy speech as the input of GRU network to compensate the missing context information. Finally, GRU network is performed to learn the mapping relationship between LPS features and log power spectrum features of clean speech spectrum. The proposed model is experimentally compared with traditional speech enhancement models, including DNN, CNN, LSTM and GRU. Experimental results demonstrate that the PESQ, SSNR and STOI of the proposed algorithm are improved by 30.72%, 39.84% and 5.53%, respectively, compared with the noise signal under the condition of matched noise. Under the condition of unmatched noise, the PESQ and STOI of the algorithm are improved by 23.8% and 37.36%, respectively. The advantage of the proposed method is that it uses the key information of features to suppress noise in both matched and unmatched noise cases and the proposed method outperforms other common methods in speech enhancement.

Speech Enhancement with Multi-granularity Vector Quantization.

Speech Enhancement Using Self-Supervised Pre-Trained Model and Vector Quantization

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Time Domain Speech Enhancement Using Self-Attention-Based Subspace Projection

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech

SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

The Method of Disentangled and Interpretable Representations for Speech Enhancement

The Study Of Computer-Aided Speech Training Method For Deaf Children Based On Learning Vector Quantization

A Deep Representation Learning-based Speech Enhancement Method Using Complex Convolution Recurrent Variational Autoencoder

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Monaural Speech Enhancement Based on Spectrogram Decomposition for Convolutional Neural Network-sensitive Feature Extraction

An Improved Vector Quantization Method Using Deep Neural Network

Speech enhancement from fused features based on deep neural network and gated recurrent unit network

Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR

Shared Network for Speech Enhancement Based on Multi-Task Learning.

Improving Speech Enhancement Via Event-based Query

LIGHT-WEIGHT VISUALVOICE: NEURAL NETWORK QUANTIZATION ON AUDIO VISUAL SPEECH SEPARATION

Audio-Visual Speech Enhancement Based on Multiscale Features and Parallel Attention

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

VSANet: Real-time Speech Enhancement Based on Voice Activity Detection and Causal Spatial Attention

A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition