Abstract:In real-world scenarios, dynamic ambient noise often degrades speech quality, highlighting the need for advanced speech enhancement techniques. Traditional methods, which rely on static embeddings as auxiliary features, struggle to address the complexities of varying noise conditions. To overcome this, we propose a Dual-stream Noise and Speech Information Perception (DNSIP) approach that dynamically detects and processes both noise and speech through innovative information extraction and suppression mechanisms. Initially, non-speech segments predominantly contain environmental noise, while speech segments carry information about the intended speaker. To handle this dynamic nature, real-time voice activity detection (VAD) is employed to accurately differentiate between speech and noise components. Building on VAD estimates, we propose an innovative information extraction framework that selectively extracts relevant noise and speech features from the noisy input, establishing a dual-stream network for concurrent noise and speech learning. To account for the temporal and spectral variability of noise and speech, a frequency-sequence attention mechanism is integrated, enhancing the model's ability to learn contextual and spectral dependencies. Additionally, an information suppression module is introduced to minimize cross-stream interference by attenuating noise within the speech stream and suppressing speech content within the noise stream. The derived noise and speech spectrograms are then utilized to formulate a minimum mean square error log-spectral amplitude (MMSE-LSA) estimator for robust speech enhancement. Experimental evaluations on the WSJ0 and VCTK+DEMAND datasets demonstrate that our DNSIP approach surpasses existing state-of-the-art methods, underscoring its efficacy in challenging acoustic environments.

Speech enhancement via two-stage dual tree complex wavelet packet transform with a speech presence probability estimator

Speech Enhancement Based on Reducing the Detail Portion of Speech Spectrograms in Modulation Domain via Discrete Wavelet Transform

Noise Estimation Using Mean Square Cross Prediction Error for Speech Enhancement

Incorporation of a modified temporal cepstrum smoothing in both signal-to-noise ratio and speech presence probability estimation for speech enhancement

Speech enhancement based on stationary bionic wavelet transform and maximum a posterior estimator of magnitude-squared spectrum

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

An Improved Speech Enhancement Algorithm Based on Wavelet Transform

Speech Enhancement Based on Short-Time Spectral Amplitude Estimates in Low SNR

A generalized time-frequency subtraction method for robust speech enhancement based on wavelet filter banks modeling of human auditory system

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Dual-Stage Low-Complexity Reconfigurable Speech Enhancement

Improvement on Automatic Speech Segmentation Using Wavelet Packet Transform Features

Speech Enhancement for Non-Stationary Noise Environments

Dual-stream Noise and Speech Information Perception based Speech Enhancement

Noise reduction using wavelet thresholding of multitaper estimators and geometric approach to spectral subtraction for speech coding strategy

TEA-PSE 2.0: Sub-Band Network for Real-Time Personalized Speech Enhancement.

Multi-task single channel speech enhancement using speech presence probability as a secondary task training target

A Speech Enhancement Algorithm Based on Computational Auditory Scene Analysis

Modeling of Teager Energy Operated Perceptual Wavelet Packet Coefficients with an Erlang-2 PDF for Real Time Enhancement of Noisy Speech

Two-stage unet with channel and temporal-frequency attention for multi-channel speech enhancement

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.