Abstract:Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information, we supplement multiple spectrograms in different frame lengths into the time-domain encoders. They extract stationary frequency information in both narrowband and wideband. We also adopt multiple decoder outputs, each of which computes its corresponding resolution frequency loss. Experimental results show that (1) it is more effective to fuse stationary frequency features than non-stationary features in the encoder, and (2) the multiple outputs consistent with the frequency loss improve performance. Experiments on the Voice-Bank dataset show that the proposed method obtained a 0.14 PESQ improvement.

What problem does this paper attempt to address?

The paper primarily focuses on the research of Time-domain Speech Enhancement (SE) technology, specifically proposing improvements to address two main issues present in the existing DEMUCS model. Firstly, the paper identifies two problems in the DEMUCS model: 1. **Issue with non-stationary signals**: The DEMUCS model employs different lengths of Short-Time Fourier Transform (STFT) in the frequency domain loss function, including non-stationary signals (such as 64ms and 128ms frame lengths), which do not align with the short-term stationary characteristics of speech signals. 2. **Mismatch between multi-resolution frequency domain loss and single output**: The DEMUCS model requires learning different resolutions of frequency domain loss through a single output, which increases the difficulty of neural network training. To address the above issues, the paper proposes the following two main improvements: ### 1. Integrating frequency information in the encoder The authors propose a method to layer-by-layer integrate multi-resolution stationary frequency domain information into the time-domain SE encoder. Specifically, spectrograms with three different window sizes (8ms, 16ms, and 32ms) are used as input features, and these spectrograms are designed as wideband and narrowband to capture different types of frequency information. Experimental results show that this method can significantly improve performance. ### 2. Multiple decoders aligned with learning objectives To alleviate the mismatch between multi-resolution frequency domain loss and single output, the paper proposes using multiple time-domain decoders. Each decoder output only calculates the STFT loss of one resolution, which helps better match the learning objectives and network output. Experimental results indicate that this method can improve performance across all resolutions. Ultimately, combining these two improvement methods (i.e., DEMUCS with multiple resolution encoders and decoders, referred to as DEMUCS-MRE-MRD) achieves significant performance improvements over the baseline model (DEMUCS) on multiple evaluation metrics, particularly improving the PESQ metric by 0.14. These improvements are significant for enhancing the stability and effectiveness of speech enhancement systems.

Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Time Domain Speech Enhancement Using Self-Attention-Based Subspace Projection

Improve Speech Enhancement Using Perception-High-Related Time-Frequency Loss.

Efficient Encoder-Decoder and Dual-Path Conformer for Comprehensive Feature Learning in Speech Enhancement

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

Time-Domain Multi-modal Bone/air Conducted Speech Enhancement

Efficient Multi-Channel Speech Enhancement with Spherical Harmonics Injection for Directional Encoding

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Multi-layer encoder-decoder time-domain single channel speech separation

Multi-stage Progressive Learning-Based Speech Enhancement Using Time–Frequency Attentive Squeezed Temporal Convolutional Networks

MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra

Multi-Objective Learning and Mask-Based Post-Processing for Deep Neural Network Based Speech Enhancement

Multi-Loss Convolutional Network with Time-Frequency Attention for Speech Enhancement

Time domain speech enhancement with CNN and time-attention transformer

ForkNet: Simultaneous Time and Time-Frequency Domain Modeling for Speech Enhancement

A time-frequency fusion model for multi-channel speech enhancement

Uformer: A Unet Based Dilated Complex & Real Dual-Path Conformer Network for Simultaneous Speech Enhancement and Dereverberation

Restorative Speech Enhancement: A Progressive Approach Using SE and Codec Modules

Speech Enhancement Using U-Net with Compressed Sensing

Speech Enhancement Based on Reducing the Detail Portion of Speech Spectrograms in Modulation Domain via Discrete Wavelet Transform