Abstract:In real-world scenarios, dynamic ambient noise often degrades speech quality, highlighting the need for advanced speech enhancement techniques. Traditional methods, which rely on static embeddings as auxiliary features, struggle to address the complexities of varying noise conditions. To overcome this, we propose a Dual-stream Noise and Speech Information Perception (DNSIP) approach that dynamically detects and processes both noise and speech through innovative information extraction and suppression mechanisms. Initially, non-speech segments predominantly contain environmental noise, while speech segments carry information about the intended speaker. To handle this dynamic nature, real-time voice activity detection (VAD) is employed to accurately differentiate between speech and noise components. Building on VAD estimates, we propose an innovative information extraction framework that selectively extracts relevant noise and speech features from the noisy input, establishing a dual-stream network for concurrent noise and speech learning. To account for the temporal and spectral variability of noise and speech, a frequency-sequence attention mechanism is integrated, enhancing the model's ability to learn contextual and spectral dependencies. Additionally, an information suppression module is introduced to minimize cross-stream interference by attenuating noise within the speech stream and suppressing speech content within the noise stream. The derived noise and speech spectrograms are then utilized to formulate a minimum mean square error log-spectral amplitude (MMSE-LSA) estimator for robust speech enhancement. Experimental evaluations on the WSJ0 and VCTK+DEMAND datasets demonstrate that our DNSIP approach surpasses existing state-of-the-art methods, underscoring its efficacy in challenging acoustic environments.

The NPU-Elevoc Personalized Speech Enhancement System for ICASSP2023 DNS Challenge

TEA-PSE: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System for ICASSP 2022 DNS Challenge

TEA-PSE 3.0: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System For ICASSP 2023 DNS Challenge

Personalized Speech Enhancement Without a Separate Speaker Embedding Model

NPU Speaker Verification System for INTERSPEECH 2020 Far-Field Speaker Verification Challenge

ICASSP 2023 Deep Noise Suppression Challenge

TEA-PSE 2.0: Sub-Band Network for Real-Time Personalized Speech Enhancement.

Deep Neural Network Based Noised Asian Speech Enhancement and Its Implementation on a Hearing Aid App.

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Two-stage Neural Network for ICASSP 2023 Speech Signal Improvement Challenge

NTU-NPU System for Voice Privacy 2024 Challenge

The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge

The NPU System for the 2020 Personalized Voice Trigger Challenge

pDenoiser: A Personalized Speech Enhancement Neural Network for Pre-hospital Emergency Medical Services.

Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet

Backend Ensemble for Speaker Verification and Spoofing Countermeasure

Dual-stream Noise and Speech Information Perception based Speech Enhancement

A speech enhancement model based on noise component decomposition: Inspired by human cognitive behavior

Dynamic noise aware training for speech enhancement based on deep neural networks.

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge

Icassp 2022 Deep Noise Suppression Challenge