Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation

Yuchen Hu,Chen Chen,Heqing Zou,Xionghu Zhong,Eng Siong Chng
DOI: https://doi.org/10.48550/arXiv.2302.11131
2023-02-22
Abstract:Recent studies in neural network-based monaural speech separation (SS) have achieved a remarkable success thanks to increasing ability of long sequence modeling. However, they would degrade significantly when put under realistic noisy conditions, as the background noise could be mistaken for speaker's speech and thus interfere with the separated sources. To alleviate this problem, we propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness. Specifically, we first build a unified network by combining speech enhancement (SE) and separation modules, with multi-task learning for optimization, where SE is supervised by parallel clean mixture to reduce noise for downstream speech separation. Furthermore, in order to avoid suppressing valid speaker information when reducing noise, we propose a gradient modulation (GM) strategy to harmonize the SE and SS tasks from optimization view. Experimental results show that our approach achieves the state-of-the-art on large-scale Libri2Mix- and Libri3Mix-noisy datasets, with SI-SNRi results of 16.0 dB and 15.8 dB respectively. Our code is available at GitHub.
Audio and Speech Processing,Machine Learning,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the significant degradation of monaural speech separation (SS) performance under real - world noisy conditions. Specifically, the background noise may be misidentified as the speaker's voice, thus interfering with the separated source signals. To alleviate this problem, the authors propose a new network architecture that combines speech enhancement (SE) and separation modules and adopts a gradient modulation (GM) strategy to improve noise robustness. ### Core Problems of the Paper 1. **Degradation of Speech Separation Performance under Noisy Conditions**: - Existing neural - network - based monaural speech separation methods perform well under ideal conditions, but their performance degrades significantly in real - world noisy environments. The background noise may be misidentified as the speaker's voice, leading to inaccurate separation results. 2. **Joint Optimization of Speech Enhancement and Separation**: - Speech enhancement can reduce noise, but it may suppress valid speaker information, leading to a decline in the performance of downstream tasks. How to avoid suppressing valid information while reducing noise is a challenge. ### Solutions 1. **Unified Network Architecture**: - The authors construct a unified network that combines the speech enhancement and separation modules. The speech enhancement module reduces noise through parallel clean - mixture supervision, providing a clearer input for the downstream speech separation task. 2. **Multi - task Learning**: - Using a multi - task learning strategy, the speech enhancement module can fully utilize the supervision information of the parallel clean - mixture, thereby optimizing the performance of the entire system. 3. **Gradient Modulation Strategy**: - To avoid suppressing valid speaker information during the speech enhancement process, the authors propose a gradient modulation strategy. This strategy adjusts the gradient conflict between the speech enhancement task and the speech separation task to ensure that the two tasks coexist harmoniously during the optimization process. ### Experimental Results - The experimental results show that this method achieves state - of - the - art performance on the large - scale benchmark datasets Libri2Mix and Libri3Mix, achieving SI - SNRi improvements of 16.0 dB and 15.8 dB respectively. ### Conclusion - By combining the speech enhancement and separation modules and adopting the gradient modulation strategy, the method proposed in this paper significantly improves the speech separation performance in noisy environments while avoiding the problem of over - suppressing valid speaker information.