Abstract:In real-world scenarios, dynamic ambient noise often degrades speech quality, highlighting the need for advanced speech enhancement techniques. Traditional methods, which rely on static embeddings as auxiliary features, struggle to address the complexities of varying noise conditions. To overcome this, we propose a Dual-stream Noise and Speech Information Perception (DNSIP) approach that dynamically detects and processes both noise and speech through innovative information extraction and suppression mechanisms. Initially, non-speech segments predominantly contain environmental noise, while speech segments carry information about the intended speaker. To handle this dynamic nature, real-time voice activity detection (VAD) is employed to accurately differentiate between speech and noise components. Building on VAD estimates, we propose an innovative information extraction framework that selectively extracts relevant noise and speech features from the noisy input, establishing a dual-stream network for concurrent noise and speech learning. To account for the temporal and spectral variability of noise and speech, a frequency-sequence attention mechanism is integrated, enhancing the model's ability to learn contextual and spectral dependencies. Additionally, an information suppression module is introduced to minimize cross-stream interference by attenuating noise within the speech stream and suppressing speech content within the noise stream. The derived noise and speech spectrograms are then utilized to formulate a minimum mean square error log-spectral amplitude (MMSE-LSA) estimator for robust speech enhancement. Experimental evaluations on the WSJ0 and VCTK+DEMAND datasets demonstrate that our DNSIP approach surpasses existing state-of-the-art methods, underscoring its efficacy in challenging acoustic environments.

CompNet: Complementary Network for Single-Channel Speech Enhancement.

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement

A Feature Integration Network for Multi-Channel Speech Enhancement

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Shared Network for Speech Enhancement Based on Multi-Task Learning.

FSI-Net: A dual-stage full- and sub-band integration network for full-band speech enhancement

Multi-Stage Progressive Speech Enhancement Network

Multi-channel end-to-end neural network for speech enhancement, source localization, and voice activity detection

Inter-SubNet: Speech Enhancement with Subband Interaction

Dual-stream Noise and Speech Information Perception based Speech Enhancement

Parallel Gated Neural Network With Attention Mechanism For Speech Enhancement

Speech Enhancement Using U-Net with Compressed Sensing

A Hybrid Deep-Learning Approach for Single Channel HF-SSB Speech Enhancement

FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement

Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer

PCNN: A Lightweight Parallel Conformer Neural Network for Efficient Monaural Speech Enhancement

MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra

DMF-Net: A decoupling-style multi-band fusion model for full-band speech enhancement

Decoupling-style monaural speech enhancement with a triple-branch cross-domain fusion network