MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra

Ye-Xin Lu,Yang Ai,Zhen-Hua Ling
DOI: https://doi.org/10.21437/Interspeech.2023-1441
2023-05-23
Abstract:This paper proposes MP-SENet, a novel Speech Enhancement Network which directly denoises Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec architecture in which the encoder and decoder are bridged by convolution-augmented transformers. The encoder aims to encode time-frequency representations from the input noisy magnitude and phase spectra. The decoder is composed of parallel magnitude mask decoder and phase decoder, directly recovering clean magnitude spectra and clean-wrapped phase spectra by incorporating learnable sigmoid activation and parallel phase estimation architecture, respectively. Multi-level losses defined on magnitude spectra, phase spectra, short-time complex spectra, and time-domain waveforms are used to train the MP-SENet model jointly. Experimental results show that our proposed MP-SENet achieves a PESQ of 3.50 on the public VoiceBank+DEMAND dataset and outperforms existing advanced speech enhancement methods.
Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address a key issue in the field of Speech Enhancement (SE): how to effectively denoise both the magnitude spectrum and the phase spectrum in the time-frequency domain. Existing speech enhancement methods are mainly divided into two categories: time-domain SE methods and time-frequency domain SE methods. While time-domain methods can directly generate clean waveforms, they are inefficient when processing high-resolution waveforms and have a quality bottleneck. In contrast, time-frequency domain methods perform better but usually only enhance the magnitude spectrum while neglecting the phase spectrum, leading to a decline in the quality of the enhanced speech. To solve the above problem, the paper proposes a new model, MP-SENet, which can denoise both the magnitude spectrum and the phase spectrum in parallel in the time-frequency domain. Specifically, MP-SENet adopts an encoder-decoder architecture and connects the encoder and decoder through Two-Stage Convolutional Enhanced Transformers (TS-Conformers) to capture local and global information. The encoder encodes the input noisy magnitude spectrum and phase spectrum into compressed time-frequency representations, and then decodes the clean magnitude spectrum and phase spectrum through parallel magnitude mask decoders and phase decoders, respectively. Experimental results show that MP-SENet outperforms existing advanced speech enhancement methods on the VoiceBank+DEMAND dataset, especially achieving significant improvements in phase spectrum prediction.