Abstract:Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech Enhancement Network that explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet comprises a Transformer-embedded encoder-decoder architecture. The encoder aims to encode the input distorted magnitude and phase spectra into time-frequency representations, which are further fed into time-frequency Transformers for alternatively capturing time and frequency dependencies. The decoder comprises a magnitude mask decoder and a phase decoder, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude masking architecture and a phase parallel estimation architecture, respectively. Multi-level loss functions explicitly defined on the magnitude spectra, wrapped phase spectra, and short-time complex spectra are adopted to jointly train the MP-SENet model. A metric discriminator is further employed to compensate for the incomplete correlation between these losses and human auditory perception. Experimental results demonstrate that our proposed MP-SENet achieves state-of-the-art performance across multiple speech enhancement tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it further mitigates the compensation effect between the magnitude and phase by explicit phase estimation, elevating the perceptual quality of enhanced speech.

Combine Waveform and Spectral Methods for Single-channel Speech Enhancement

PhaseDCN: A Phase-Enhanced Dual-Path Dilated Convolutional Network for Single-Channel Speech Enhancement.

A Speech Enhancement Method Based on Dual-Path Phase-Aware GAN Networks

Single-channel speech enhancement using improved progressive deep neural network and masking-based harmonic regeneration

Multichannel Speech Enhancement without Beamforming

PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network

Speech Enhancement with Phase Correction based on Modified DNN Architecture.

Phase-Aware Speech Enhancement Based on Deep Neural Networks

Improving Speech Enhancement with Phonetic Embedding Features

A Joint Framework of Denoising Autoencoder and Generative Vocoder for Monaural Speech Enhancement

Phase Unwrapping Based Speech Enhancement

Speech Enhancement Using the Combination of Adaptive Wavelet Threshold and Spectral Subtraction Based on Wavelet Packet Decomposition

Single Channel Speech Enhancement Using Temporal Convolutional Recurrent Neural Networks.

Magnitude-and-phase-aware Speech Enhancement with Parallel Sequence Modeling

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

Distant-talking Speech Recognition Based on Multi-objective Learning Using Phase and Magnitude-based Feature

Mask Estimation Incorporating Phase-Sensitive Information for Speech Enhancement

Two Heads Are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement.

Waveform-domain Speech Enhancement Using Spectrogram Encoding for Robust Speech Recognition

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Speech Perception Improvement Algorithm Based on a Dual-Path Long Short-Term Memory Network