Abstract:In this study, we propose a novel deep neural network (DNN) architecture for speech enhancement (SE) via a multiobjective learning and ensembling (MOLE) framework to achieve a compact and lowlatency design, while maintaining good performance in quality evaluations. MOLE follows the boosting concept when combining weak models into a strong classifier and consists of two compact DNNs. The first, called the multiobjective learning DNN (MOL-DNN), takes multiple features, such as log-power spectra (LPS), mel-frequency cepstral coefficients (MFCCs) and Gammatone frequency cepstral coefficients (GFCCs) to predict a multiobjective set that includes clean speech feature, dynamic noise feature, and ideal ratio mask (IRM). The second, called the multiobjective ensembling DNN (MOE-DNN), takes the learned features from MOL-DNN as inputs and separately predicts clean LPS and IRM, clean MFCC and IRM, and clean GFCC and IRM using three sets of weak regression functions. Finally, a postprocessing operation can be applied to the estimated clean features by leveraging the multiple targets learned from both the MOL-DNN and the MOE-DNN. On speech corrupted by 15 noise types not seen in model training the SE results show that the MOLE approach, which features a small model size and low run-time latency, can achieve consistent improvements over both DNN- and long short-term memory (LSTM)-based techniques in terms of all the objective metrics evaluated in this study for all three cases (the input contexts contain 1-frame, 4-frame and 7-frame instances). The 1-frame MOLE-based SE system outperforms the DNN-based SE system with a 7-frame input expansion at a 3-frame delay and also achieves better performance than the LSTM-based SE system with 4-frame, no delay expansion by including only 3 previous frames, and with 170 times less processing latency.

A Cross-Entropy-Guided Measure (CEGM) for Assessing Speech Recognition Performance and Optimizing DNN-Based Speech Enhancement

A Cross-Entropy-Guided (CEG) Measure for Speech Enhancement Front-End Assessing Performances of Back-End Automatic Speech Recognition

Single-Channel Speech Enhancement Algorithm Based on ME-MGCRN in Low Signal-to-Noise Scenario

Acoustics-guided Evaluation (AGE): a New Measure for Estimating Performance of Speech Enhancement Algorithms for Robust ASR

Multi-CMGAN+/+: Leveraging Multi-Objective Speech Quality Metric Prediction for Speech Enhancement

Optimizing Audio-Visual Speech Enhancement Using Multi-Level Distortion Measures for Audio-Visual Speech Recognition

Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System

Using Generalized Gaussian Distributions to Improve Regression Error Modeling for Deep Learning-Based Speech Enhancement.

MAMGAN: Multiscale attention metric GAN for monaural speech enhancement in the time domain

Compensation of Speech Enhancement Distortion for Robust Speech Recognition

An Iterative Mask Estimation Approach to Deep Learning Based Multi-Channel Speech Recognition

A Maximum Likelihood Approach to SNR-Progressive Learning Using Generalized Gaussian Distribution for LSTM-Based Speech Enhancement.

DNN Training Based on Classic Gain Function for Single-channel Speech Enhancement and Recognition.

A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement with Compact Neural Network Architectures

Using MCE Algorithm to Improve the Performance of Speaker Recognition

Speech Enhancement Based on Deep Mixture of Distinguishing Experts

CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

Integrating Lattice-Free MMI into End-to-End Speech Recognition

LSTM-Based Iterative Mask Estimation and Post-Processing for Multi-Channel Speech Enhancement

CMGAN: Conformer-based Metric GAN for Speech Enhancement

CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction