Abstract:In multimedia intelligent systems, speech enhancement is commonly employed to improve the quality of speech signals, making them clearer and more natural. Current deep learning-based speech enhancement models typically treat noise as a unified entity and aim to separate it from the target speech. In this paper, inspired by the cognitive behavior of the human brain when observing noisy speech spectrograms, we decompose the spectral energy of noise into regular and random components. We propose an auxiliary-model-based speech enhancement framework that better suppresses noise components closely resembling speech features. Firstly, we introduce a voiceprint segmentation network (VSnet) that partitions noisy speech into voiceprint and non-voiceprint regions. Subsequently, we present a noise reconstruction network (NRnet) that utilizes noise information from non-voiceprint regions to reconstruct and suppress the regular noise components within the voiceprint region. Finally, we construct a combination of a model dedicated to suppressing random components (RANnet) and a speech enhancement model (SEnet), and train them synchronously. By sharing encoder parameters, SEnet is compelled to reduce the extraction of regular noise features from the original noisy speech, contributing to improving speech quality generated through the decoder. Experimental results on public Voickbank-DEMAND and DNS-challenge 2020 datasets demonstrate that our approach achieves state-of-the-art performance.

Denoi-SpEx plus : A Speaker Extraction Network based Speech Dialogue System

DENOISPEECH: DENOISING TEXT TO SPEECH WITH FRAME-LEVEL NOISE MODELING

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

MC-SpEx: Towards Effective Speaker Extraction with Multi-Scale Interfusion and Conditional Speaker Modulation

Boosting the Performance of SpEx plus by Attention and Contextual Mechanism

A Unified Speaker-Dependent Speech Separation and Enhancement System Based on Deep Neural Networks.

A speech enhancement model based on noise component decomposition: Inspired by human cognitive behavior

Boosting the Performance of SpEx+ by Attention and Contextual Mechanism

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

Dialogue Topic Segmentation Via Parallel Extraction Network with Neighbor Smoothing

MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra

Shared Network for Speech Enhancement Based on Multi-Task Learning.

A Unified DNN Approach to Speaker-Dependent Simultaneous Speech Enhancement and Speech Separation in Low SNR Environments

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Progressive Multi-Target Network Based Speech Enhancement with Snr-Preselection for Robust Speaker Diarization

X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network

Conditional Diffusion Model for Target Speaker Extraction

Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement

pDenoiser: A Personalized Speech Enhancement Neural Network for Pre-hospital Emergency Medical Services.

Bridging the Gap Between Clean Data Training and Real-World Inference for Spoken Language Understanding

SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation