Abstract:Abstract Speech is easily interfered by external environment in reality, which results in the loss of important features. Deep learning has become a popular speech enhancement method because of its superior potential in solving nonlinear mapping problems for complex features. However, the deficiency of traditional deep learning methods is the weak learning capability of important information from previous time steps and long-term event dependencies between the time-series data. To overcome this problem, we propose a novel speech enhancement method based on the fused features of deep neural networks (DNNs) and gated recurrent unit (GRU). The proposed method uses GRU to reduce the number of parameters of DNNs and acquire the context information of the speech, which improves the enhanced speech quality and intelligibility. Firstly, DNN with multiple hidden layers is used to learn the mapping relationship between the logarithmic power spectrum (LPS) features of noisy speech and clean speech. Secondly, the LPS feature of the deep neural network is fused with the noisy speech as the input of GRU network to compensate the missing context information. Finally, GRU network is performed to learn the mapping relationship between LPS features and log power spectrum features of clean speech spectrum. The proposed model is experimentally compared with traditional speech enhancement models, including DNN, CNN, LSTM and GRU. Experimental results demonstrate that the PESQ, SSNR and STOI of the proposed algorithm are improved by 30.72%, 39.84% and 5.53%, respectively, compared with the noise signal under the condition of matched noise. Under the condition of unmatched noise, the PESQ and STOI of the algorithm are improved by 23.8% and 37.36%, respectively. The advantage of the proposed method is that it uses the key information of features to suppress noise in both matched and unmatched noise cases and the proposed method outperforms other common methods in speech enhancement.

GBNF-VAE: A Pathological Voice Enhancement Model Based on Gold Section for Bottleneck Feature With Variational Autoencoder

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

VSEGAN: Visual Speech Enhancement Generative Adversarial Network

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

A variance modeling framework based on variational autoencoders for speech enhancement

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model

Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization

Pathological voice adaptation with autoencoder-based voice conversion

Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition

A Speech Enhancement Method Based on Dual-Path Phase-Aware GAN Networks

Disentangling Content and Fine-Grained Prosody Information Via Hybrid ASR Bottleneck Features for Voice Conversion

Parallel Gated Neural Network With Attention Mechanism For Speech Enhancement

A Joint Framework of Denoising Autoencoder and Generative Vocoder for Monaural Speech Enhancement

VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders

Improvement of Packet Loss Concealment for EVS Codec Based on Deep Learning

Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech

NVCGAN: Leveraging Generative Adversarial Networks for Robust Voice Conversion

A speech enhancement model based on noise component decomposition: Inspired by human cognitive behavior

Speech enhancement from fused features based on deep neural network and gated recurrent unit network

Incorporating Real-world Noisy Speech in Neural-network-based Speech Enhancement Systems