Abstract:Recent studies in neural network-based monaural speech separation (SS) have achieved a remarkable success thanks to increasing ability of long sequence modeling. However, they would degrade significantly when put under realistic noisy conditions, as the background noise could be mistaken for speaker's speech and thus interfere with the separated sources. To alleviate this problem, we propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness. Specifically, we first build a unified network by combining speech enhancement (SE) and separation modules, with multi-task learning for optimization, where SE is supervised by parallel clean mixture to reduce noise for downstream speech separation. Furthermore, in order to avoid suppressing valid speaker information when reducing noise, we propose a gradient modulation (GM) strategy to harmonize the SE and SS tasks from optimization view. Experimental results show that our approach achieves the state-of-the-art on large-scale Libri2Mix- and Libri3Mix-noisy datasets, with SI-SNRi results of 16.0 dB and 15.8 dB respectively. Our code is available at GitHub.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the significant degradation of monaural speech separation (SS) performance under real - world noisy conditions. Specifically, the background noise may be misidentified as the speaker's voice, thus interfering with the separated source signals. To alleviate this problem, the authors propose a new network architecture that combines speech enhancement (SE) and separation modules and adopts a gradient modulation (GM) strategy to improve noise robustness. ### Core Problems of the Paper 1. **Degradation of Speech Separation Performance under Noisy Conditions**: - Existing neural - network - based monaural speech separation methods perform well under ideal conditions, but their performance degrades significantly in real - world noisy environments. The background noise may be misidentified as the speaker's voice, leading to inaccurate separation results. 2. **Joint Optimization of Speech Enhancement and Separation**: - Speech enhancement can reduce noise, but it may suppress valid speaker information, leading to a decline in the performance of downstream tasks. How to avoid suppressing valid information while reducing noise is a challenge. ### Solutions 1. **Unified Network Architecture**: - The authors construct a unified network that combines the speech enhancement and separation modules. The speech enhancement module reduces noise through parallel clean - mixture supervision, providing a clearer input for the downstream speech separation task. 2. **Multi - task Learning**: - Using a multi - task learning strategy, the speech enhancement module can fully utilize the supervision information of the parallel clean - mixture, thereby optimizing the performance of the entire system. 3. **Gradient Modulation Strategy**: - To avoid suppressing valid speaker information during the speech enhancement process, the authors propose a gradient modulation strategy. This strategy adjusts the gradient conflict between the speech enhancement task and the speech separation task to ensure that the two tasks coexist harmoniously during the optimization process. ### Experimental Results - The experimental results show that this method achieves state - of - the - art performance on the large - scale benchmark datasets Libri2Mix and Libri3Mix, achieving SI - SNRi improvements of 16.0 dB and 15.8 dB respectively. ### Conclusion - By combining the speech enhancement and separation modules and adopting the gradient modulation strategy, the method proposed in this paper significantly improves the speech separation performance in noisy environments while avoiding the problem of over - suppressing valid speaker information.

Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Noise-Aware Speech Separation with Contrastive Learning

A Unified DNN Approach to Speaker-Dependent Simultaneous Speech Enhancement and Speech Separation in Low SNR Environments

A Unified Speaker-Dependent Speech Separation and Enhancement System Based on Deep Neural Networks.

Noise-robust Speech Separation with Fast Generative Correction

Single-Channel Speech Enhancement Algorithm Based on ME-MGCRN in Low Signal-to-Noise Scenario

Glmsnet: single channel speech separation framework in noisy and reverberant environments

Two-stage Model and Optimal SI-SNR for Monaural Multi-Speaker Speech Separation in Noisy Environment

A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks.

Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training

Shared Network for Speech Enhancement Based on Multi-Task Learning.

A speech enhancement model based on noise component decomposition: Inspired by human cognitive behavior

A Multi-Stage Triple-Path Method for Speech Separation in Noisy and Reverberant Environments

Real-time Speech Enhancement and Separation with a Unified Deep Neural Network for Single/Dual Talker Scenarios

A Refining Underlying Information Framework for Monaural Speech Enhancement

Plugin Speech Enhancement: A Universal Speech Enhancement Framework Inspired by Dynamic Neural Network

Unsupervised Single-Channel Speech Separation Via Deep Neural Network for Different Gender Mixtures

End-to-end Networks for Supervised Single-channel Speech Separation

Optimal Scale-Invariant Signal-to-noise Ratio and Curriculum Learning for Monaural Multi-Speaker Speech Separation in Noisy Environment.