Abstract:In this study, we propose a novel deep neural network (DNN) architecture for speech enhancement (SE) via a multiobjective learning and ensembling (MOLE) framework to achieve a compact and lowlatency design, while maintaining good performance in quality evaluations. MOLE follows the boosting concept when combining weak models into a strong classifier and consists of two compact DNNs. The first, called the multiobjective learning DNN (MOL-DNN), takes multiple features, such as log-power spectra (LPS), mel-frequency cepstral coefficients (MFCCs) and Gammatone frequency cepstral coefficients (GFCCs) to predict a multiobjective set that includes clean speech feature, dynamic noise feature, and ideal ratio mask (IRM). The second, called the multiobjective ensembling DNN (MOE-DNN), takes the learned features from MOL-DNN as inputs and separately predicts clean LPS and IRM, clean MFCC and IRM, and clean GFCC and IRM using three sets of weak regression functions. Finally, a postprocessing operation can be applied to the estimated clean features by leveraging the multiple targets learned from both the MOL-DNN and the MOE-DNN. On speech corrupted by 15 noise types not seen in model training the SE results show that the MOLE approach, which features a small model size and low run-time latency, can achieve consistent improvements over both DNN- and long short-term memory (LSTM)-based techniques in terms of all the objective metrics evaluated in this study for all three cases (the input contexts contain 1-frame, 4-frame and 7-frame instances). The 1-frame MOLE-based SE system outperforms the DNN-based SE system with a 7-frame input expansion at a 3-frame delay and also achieves better performance than the LSTM-based SE system with 4-frame, no delay expansion by including only 3 previous frames, and with 170 times less processing latency.

Noise Modeling to Build Training Sets for Robust Speech Enhancement

Noise Modeling to Build Training Sets for Robust Speech Enhancement

DENOISPEECH: DENOISING TEXT TO SPEECH WITH FRAME-LEVEL NOISE MODELING

Multi-scale Generative Adversarial Networks for Speech Enhancement

Study of GANs for Noisy Speech Simulation from Clean Speech

Boosting Noise Robustness of Acoustic Model via Deep Adversarial Training

Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation

On Generating Mixing Noise Signals With Basis Functions For Simulating Noisy Speech And Learning Dnn-Based Speech Enhancement Models

Speech Enhancement Based on Noise Classification and Deep Neural Network

Noise-aware Speech Enhancement using Diffusion Probabilistic Model

Dynamic noise aware training for speech enhancement based on deep neural networks.

Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification

A speech enhancement model based on noise component decomposition: Inspired by human cognitive behavior

A Novel Training Strategy Using Dynamic Data Generation for Deep Neural Network Based Speech Enhancement.

Feature-Matching Speech Denoising GANs via Progressive Training.

Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement

Joint Ideal Ratio Mask and Generative Adversarial Networks for Monaural Speech Enhancement

Double Adversarial Network Based Monaural Speech Enhancement for Robust Speech Recognition.

A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement with Compact Neural Network Architectures

SEGAN: Speech Enhancement Generative Adversarial Network

Dynamic Noise Embedding: Noise Aware Training and Adaptation for Speech Enhancement