Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder

Hao Shi,Masato Mimura,Longbiao Wang,Jianwu Dang,Tatsuya Kawahara
2023-03-26
Abstract:Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information, we supplement multiple spectrograms in different frame lengths into the time-domain encoders. They extract stationary frequency information in both narrowband and wideband. We also adopt multiple decoder outputs, each of which computes its corresponding resolution frequency loss. Experimental results show that (1) it is more effective to fuse stationary frequency features than non-stationary features in the encoder, and (2) the multiple outputs consistent with the frequency loss improve performance. Experiments on the Voice-Bank dataset show that the proposed method obtained a 0.14 PESQ improvement.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily focuses on the research of Time-domain Speech Enhancement (SE) technology, specifically proposing improvements to address two main issues present in the existing DEMUCS model. Firstly, the paper identifies two problems in the DEMUCS model: 1. **Issue with non-stationary signals**: The DEMUCS model employs different lengths of Short-Time Fourier Transform (STFT) in the frequency domain loss function, including non-stationary signals (such as 64ms and 128ms frame lengths), which do not align with the short-term stationary characteristics of speech signals. 2. **Mismatch between multi-resolution frequency domain loss and single output**: The DEMUCS model requires learning different resolutions of frequency domain loss through a single output, which increases the difficulty of neural network training. To address the above issues, the paper proposes the following two main improvements: ### 1. Integrating frequency information in the encoder The authors propose a method to layer-by-layer integrate multi-resolution stationary frequency domain information into the time-domain SE encoder. Specifically, spectrograms with three different window sizes (8ms, 16ms, and 32ms) are used as input features, and these spectrograms are designed as wideband and narrowband to capture different types of frequency information. Experimental results show that this method can significantly improve performance. ### 2. Multiple decoders aligned with learning objectives To alleviate the mismatch between multi-resolution frequency domain loss and single output, the paper proposes using multiple time-domain decoders. Each decoder output only calculates the STFT loss of one resolution, which helps better match the learning objectives and network output. Experimental results indicate that this method can improve performance across all resolutions. Ultimately, combining these two improvement methods (i.e., DEMUCS with multiple resolution encoders and decoders, referred to as DEMUCS-MRE-MRD) achieves significant performance improvements over the baseline model (DEMUCS) on multiple evaluation metrics, particularly improving the PESQ metric by 0.14. These improvements are significant for enhancing the stability and effectiveness of speech enhancement systems.