Abstract:In this study, we propose a novel deep neural network (DNN) architecture for speech enhancement (SE) via a multiobjective learning and ensembling (MOLE) framework to achieve a compact and lowlatency design, while maintaining good performance in quality evaluations. MOLE follows the boosting concept when combining weak models into a strong classifier and consists of two compact DNNs. The first, called the multiobjective learning DNN (MOL-DNN), takes multiple features, such as log-power spectra (LPS), mel-frequency cepstral coefficients (MFCCs) and Gammatone frequency cepstral coefficients (GFCCs) to predict a multiobjective set that includes clean speech feature, dynamic noise feature, and ideal ratio mask (IRM). The second, called the multiobjective ensembling DNN (MOE-DNN), takes the learned features from MOL-DNN as inputs and separately predicts clean LPS and IRM, clean MFCC and IRM, and clean GFCC and IRM using three sets of weak regression functions. Finally, a postprocessing operation can be applied to the estimated clean features by leveraging the multiple targets learned from both the MOL-DNN and the MOE-DNN. On speech corrupted by 15 noise types not seen in model training the SE results show that the MOLE approach, which features a small model size and low run-time latency, can achieve consistent improvements over both DNN- and long short-term memory (LSTM)-based techniques in terms of all the objective metrics evaluated in this study for all three cases (the input contexts contain 1-frame, 4-frame and 7-frame instances). The 1-frame MOLE-based SE system outperforms the DNN-based SE system with a 7-frame input expansion at a 3-frame delay and also achieves better performance than the LSTM-based SE system with 4-frame, no delay expansion by including only 3 previous frames, and with 170 times less processing latency.

Lite-RTSE: Exploring a Cost-Effective Lite DNN Model for Real-Time Speech Enhancement in RTC Scenarios

LiSenNet: Lightweight Sub-band and Dual-Path Modeling for Real-Time Speech Enhancement

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

CheapNET: Improving Light-weight speech enhancement network by projected loss function

Towards efficient models for real-time deep noise suppression

Ultra-Low Latency Speech Enhancement - A Comprehensive Study

Modulating State Space Model with SlowFast Framework for Compute-Efficient Ultra Low-Latency Speech Enhancement

On real-time multi-stage speech enhancement systems

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement

Neural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated Full- and Sub-Band Modeling

A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement with Compact Neural Network Architectures

A Lightweight and Real-Time Binaural Speech Enhancement Model with Spatial Cues Preservation

Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet

A speech enhancement model based on noise component decomposition: Inspired by human cognitive behavior

Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed FP16-INT8 Post-Training Quantization

Decoupled Spatial and Temporal Processing for Resource Efficient Multichannel Speech Enhancement

Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network

Monaural Speech Enhancement using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure

Multichannel Speech Enhancement Based on Time-Frequency Masking Using Subband Long Short-Term Memory