Improving Design of Input Condition Invariant Speech Enhancement

Wangyou Zhang,Jee-weon Jung,Shinji Watanabe,Yanmin Qian

2024-02-16

Abstract:Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model was recently proposed showing promising performance; however, its multi-channel performance degraded severely in real conditions. In this paper we propose novel architectures to improve the input condition invariant SE model so that performance in simulated conditions remains competitive while real condition degradation is much mitigated. For this purpose, we redesign the key components that comprise such a system. First, we identify that the channel-modeling module's generalization to unseen scenarios can be sub-optimal and redesign this module. We further introduce a two-stage training strategy to enhance training efficiency. Second, we propose two novel dual-path time-frequency blocks, demonstrating superior performance with fewer parameters and computational costs compared to the existing method. All proposals combined, experiments on various public datasets validate the efficacy of the proposed model, with significantly improved performance on real conditions. Recipe with full model details is released at

Audio and Speech Processing,Sound

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to construct a general - purpose speech enhancement (SE) system capable of handling arbitrary input conditions, especially for different audio durations, sampling frequencies and microphone configurations in noisy and reverberant environments. Such systems are known as "input condition invariant SE". Although the recently proposed USE (Unconstrained Speech Enhancement and Separation network) model has shown promising performance, its multi - channel performance drops severely under practical conditions. Therefore, this paper aims to improve the design of the USE model to enhance its competitiveness under simulated conditions and significantly mitigate performance degradation under practical conditions. Specifically, the paper solves the problem through the following methods: 1. **Redesign key components**: Identify and improve the generalization ability of the channel - modeling module. 2. **Introduce a two - stage training strategy**: Enhance training efficiency. 3. **Propose two new dual - path time - frequency blocks**: Exhibit superior performance while reducing parameters and computational costs compared to existing methods. The experimental results show that the performance of the proposed model has been verified on various public datasets, especially with a significant improvement under practical conditions. All the improvement measures combined make the new model more robust and have better generalization ability when dealing with different input conditions.

Improving Design of Input Condition Invariant Speech Enhancement

Toward Universal Speech Enhancement for Diverse Input Conditions

Time Domain Speech Enhancement Using Self-Attention-Based Subspace Projection

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Know Your Enemy, Know Yourself: A Unified Two-Stage Framework for Speech Enhancement

LiSenNet: Lightweight Sub-band and Dual-Path Modeling for Real-Time Speech Enhancement

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

Enhancing Anti-spoofing Countermeasures Robustness through Joint Optimization and Transfer Learning

Real-time Speech Enhancement with Dynamic Attention Span.

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

All Information is Necessary: Integrating Speech Positive and Negative Information by Contrastive Learning for Speech Enhancement

On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training

Compact Deep Neural Networks for Real-Time Speech Enhancement on Resource-Limited Devices

A Refining Underlying Information Framework for Monaural Speech Enhancement

A speech enhancement model based on noise component decomposition: Inspired by human cognitive behavior

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

A Modified Speech Enhancement Algorithm Using a Universal Speaker Model

Restorative Speech Enhancement: A Progressive Approach Using SE and Codec Modules

Incorporating Symbolic Sequential Modeling for Speech Enhancement

Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR