Improving Design of Input Condition Invariant Speech Enhancement

Wangyou Zhang,Jee-weon Jung,Shinji Watanabe,Yanmin Qian
2024-02-16
Abstract:Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model was recently proposed showing promising performance; however, its multi-channel performance degraded severely in real conditions. In this paper we propose novel architectures to improve the input condition invariant SE model so that performance in simulated conditions remains competitive while real condition degradation is much mitigated. For this purpose, we redesign the key components that comprise such a system. First, we identify that the channel-modeling module's generalization to unseen scenarios can be sub-optimal and redesign this module. We further introduce a two-stage training strategy to enhance training efficiency. Second, we propose two novel dual-path time-frequency blocks, demonstrating superior performance with fewer parameters and computational costs compared to the existing method. All proposals combined, experiments on various public datasets validate the efficacy of the proposed model, with significantly improved performance on real conditions. Recipe with full model details is released at
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to construct a general - purpose speech enhancement (SE) system capable of handling arbitrary input conditions, especially for different audio durations, sampling frequencies and microphone configurations in noisy and reverberant environments. Such systems are known as "input condition invariant SE". Although the recently proposed USE (Unconstrained Speech Enhancement and Separation network) model has shown promising performance, its multi - channel performance drops severely under practical conditions. Therefore, this paper aims to improve the design of the USE model to enhance its competitiveness under simulated conditions and significantly mitigate performance degradation under practical conditions. Specifically, the paper solves the problem through the following methods: 1. **Redesign key components**: Identify and improve the generalization ability of the channel - modeling module. 2. **Introduce a two - stage training strategy**: Enhance training efficiency. 3. **Propose two new dual - path time - frequency blocks**: Exhibit superior performance while reducing parameters and computational costs compared to existing methods. The experimental results show that the performance of the proposed model has been verified on various public datasets, especially with a significant improvement under practical conditions. All the improvement measures combined make the new model more robust and have better generalization ability when dealing with different input conditions.