MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra

Ye-Xin Lu,Yang Ai,Zhen-Hua Ling

DOI: https://doi.org/10.21437/Interspeech.2023-1441

2023-05-23

Abstract:This paper proposes MP-SENet, a novel Speech Enhancement Network which directly denoises Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec architecture in which the encoder and decoder are bridged by convolution-augmented transformers. The encoder aims to encode time-frequency representations from the input noisy magnitude and phase spectra. The decoder is composed of parallel magnitude mask decoder and phase decoder, directly recovering clean magnitude spectra and clean-wrapped phase spectra by incorporating learnable sigmoid activation and parallel phase estimation architecture, respectively. Multi-level losses defined on magnitude spectra, phase spectra, short-time complex spectra, and time-domain waveforms are used to train the MP-SENet model jointly. Experimental results show that our proposed MP-SENet achieves a PESQ of 3.50 on the public VoiceBank+DEMAND dataset and outperforms existing advanced speech enhancement methods.

Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address a key issue in the field of Speech Enhancement (SE): how to effectively denoise both the magnitude spectrum and the phase spectrum in the time-frequency domain. Existing speech enhancement methods are mainly divided into two categories: time-domain SE methods and time-frequency domain SE methods. While time-domain methods can directly generate clean waveforms, they are inefficient when processing high-resolution waveforms and have a quality bottleneck. In contrast, time-frequency domain methods perform better but usually only enhance the magnitude spectrum while neglecting the phase spectrum, leading to a decline in the quality of the enhanced speech. To solve the above problem, the paper proposes a new model, MP-SENet, which can denoise both the magnitude spectrum and the phase spectrum in parallel in the time-frequency domain. Specifically, MP-SENet adopts an encoder-decoder architecture and connects the encoder and decoder through Two-Stage Convolutional Enhanced Transformers (TS-Conformers) to capture local and global information. The encoder encodes the input noisy magnitude spectrum and phase spectrum into compressed time-frequency representations, and then decodes the clean magnitude spectrum and phase spectrum through parallel magnitude mask decoders and phase decoders, respectively. Experimental results show that MP-SENet outperforms existing advanced speech enhancement methods on the VoiceBank+DEMAND dataset, especially achieving significant improvements in phase spectrum prediction.

MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

DENOISPEECH: DENOISING TEXT TO SPEECH WITH FRAME-LEVEL NOISE MODELING

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Time Domain Speech Enhancement Using Self-Attention-Based Subspace Projection

Shared Network for Speech Enhancement Based on Multi-Task Learning.

LiSenNet: Lightweight Sub-band and Dual-Path Modeling for Real-Time Speech Enhancement

A speech enhancement model based on noise component decomposition: Inspired by human cognitive behavior

PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement

Speech Enhancement Using U-Net with Compressed Sensing

Magnitude-and-phase-aware Speech Enhancement with Parallel Sequence Modeling

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

A Neural Denoising Vocoder for Clean Waveform Generation from Noisy Mel-Spectrogram based on Amplitude and Phase Predictions

A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement with Compact Neural Network Architectures

Parallel Gated Neural Network With Attention Mechanism For Speech Enhancement

Guided Speech Enhancement Network

Multi-Stage Progressive Speech Enhancement Network

TENET: A Time-reversal Enhancement Network for Noise-robust ASR

CompNet: Complementary Network for Single-Channel Speech Enhancement.

End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization