Abstract:In this study, we propose a novel deep neural network (DNN) architecture for speech enhancement (SE) via a multiobjective learning and ensembling (MOLE) framework to achieve a compact and lowlatency design, while maintaining good performance in quality evaluations. MOLE follows the boosting concept when combining weak models into a strong classifier and consists of two compact DNNs. The first, called the multiobjective learning DNN (MOL-DNN), takes multiple features, such as log-power spectra (LPS), mel-frequency cepstral coefficients (MFCCs) and Gammatone frequency cepstral coefficients (GFCCs) to predict a multiobjective set that includes clean speech feature, dynamic noise feature, and ideal ratio mask (IRM). The second, called the multiobjective ensembling DNN (MOE-DNN), takes the learned features from MOL-DNN as inputs and separately predicts clean LPS and IRM, clean MFCC and IRM, and clean GFCC and IRM using three sets of weak regression functions. Finally, a postprocessing operation can be applied to the estimated clean features by leveraging the multiple targets learned from both the MOL-DNN and the MOE-DNN. On speech corrupted by 15 noise types not seen in model training the SE results show that the MOLE approach, which features a small model size and low run-time latency, can achieve consistent improvements over both DNN- and long short-term memory (LSTM)-based techniques in terms of all the objective metrics evaluated in this study for all three cases (the input contexts contain 1-frame, 4-frame and 7-frame instances). The 1-frame MOLE-based SE system outperforms the DNN-based SE system with a 7-frame input expansion at a 3-frame delay and also achieves better performance than the LSTM-based SE system with 4-frame, no delay expansion by including only 3 previous frames, and with 170 times less processing latency.

TEA-PSE 2.0: Sub-Band Network for Real-Time Personalized Speech Enhancement.

TEA-PSE: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System for ICASSP 2022 DNS Challenge

TEA-PSE 3.0: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System For ICASSP 2023 DNS Challenge

Improve Speech Enhancement Using Perception-High-Related Time-Frequency Loss.

A lightweight dual-stage framework for personalized speech enhancement based on DeepFilterNet2

The NPU-Elevoc Personalized Speech Enhancement System for ICASSP2023 DNS Challenge

Personalized Speech Enhancement Without a Separate Speaker Embedding Model

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

LiSenNet: Lightweight Sub-band and Dual-Path Modeling for Real-Time Speech Enhancement

PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

High Fidelity Speech Enhancement with Band-split RNN

Noise Adaptive Speech Enhancement Using Domain Adversarial Training.

TENET: A Time-reversal Enhancement Network for Noise-robust ASR

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

pDenoiser: A Personalized Speech Enhancement Neural Network for Pre-hospital Emergency Medical Services.

Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement

MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra

Deep Noise Suppression Maximizing Non-Differentiable PESQ Mediated by a Non-Intrusive PESQNet

A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement with Compact Neural Network Architectures

Array Configuration-Agnostic Personalized Speech Enhancement Using Long-Short-Term Spatial Coherence