Abstract:In this study, we propose a novel deep neural network (DNN) architecture for speech enhancement (SE) via a multiobjective learning and ensembling (MOLE) framework to achieve a compact and lowlatency design, while maintaining good performance in quality evaluations. MOLE follows the boosting concept when combining weak models into a strong classifier and consists of two compact DNNs. The first, called the multiobjective learning DNN (MOL-DNN), takes multiple features, such as log-power spectra (LPS), mel-frequency cepstral coefficients (MFCCs) and Gammatone frequency cepstral coefficients (GFCCs) to predict a multiobjective set that includes clean speech feature, dynamic noise feature, and ideal ratio mask (IRM). The second, called the multiobjective ensembling DNN (MOE-DNN), takes the learned features from MOL-DNN as inputs and separately predicts clean LPS and IRM, clean MFCC and IRM, and clean GFCC and IRM using three sets of weak regression functions. Finally, a postprocessing operation can be applied to the estimated clean features by leveraging the multiple targets learned from both the MOL-DNN and the MOE-DNN. On speech corrupted by 15 noise types not seen in model training the SE results show that the MOLE approach, which features a small model size and low run-time latency, can achieve consistent improvements over both DNN- and long short-term memory (LSTM)-based techniques in terms of all the objective metrics evaluated in this study for all three cases (the input contexts contain 1-frame, 4-frame and 7-frame instances). The 1-frame MOLE-based SE system outperforms the DNN-based SE system with a 7-frame input expansion at a 3-frame delay and also achieves better performance than the LSTM-based SE system with 4-frame, no delay expansion by including only 3 previous frames, and with 170 times less processing latency.

Monaural Speech Enhancement Using Deep Multi-Branch Residual Network with 1-D Causal Dilated Convolutions

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses

Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Multi-channel end-to-end neural network for speech enhancement, source localization, and voice activity detection

PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement

Deep Residual-Dense Lattice Network for Speech Enhancement

Multichannel Speech Enhancement without Beamforming

D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition

Full Attention Bidirectional Deep Learning Structure for Single Channel Speech Enhancement

Multi-branch Learning for Noisy and Reverberant Monaural Speech Separation

Residual Convolutional CTC Networks for Automatic Speech Recognition.

A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement with Compact Neural Network Architectures

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Multi-Channel Automatic Speech Recognition Using Deep Complex Unet

Speech enhancement using progressive learning-based convolutional recurrent neural network

Multichannel Speech Enhancement Based on Time-Frequency Masking Using Subband Long Short-Term Memory

Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement