Abstract:The decoupling-style concept begins to ignite in the speech enhancement area, which decouples the original complex spectrum estimation task into multiple easier sub-tasks (i.e., the magnitude-only recovery and residual complex spectrum estimation), resulting in better performance and easier interpretability. In this paper, we propose a dual-branch federative magnitude and phase estimation framework, dubbed DBT-Net, for monaural speech enhancement, aiming at recovering the coarse- and fine-grained regions of the overall spectrum in parallel. From the complementary perspective, the magnitude estimation branch is designed to filter out dominant noise components in the magnitude domain, while the complex spectrum purification branch is elaborately designed to inpaint the missing spectral details and implicitly estimate the phase information in the complex-valued spectral domain. To facilitate the information flow between each branch, interaction modules are introduced to leverage features learned from one branch, so as to suppress the undesired parts and recover the missing components of the other branch. Instead of adopting the conventional RNNs and temporal convolutional networks for sequence modeling, we employ a novel attention-in-attention transformer-based network within each branch for better feature learning. More specially, it is composed of several adaptive spectro-temporal attention transformer-based modules and an adaptive hierarchical attention module, aiming to capture long-term time-frequency dependencies and further aggregate intermediate hierarchical contextual information. Comprehensive evaluations on the WSJ0-SI84 + DNS-Challenge and VoiceBank + DEMAND dataset demonstrate that the proposed approach consistently outperforms previous advanced systems and yields state-of-the-art performance in terms of speech quality and intelligibility.

Improving Monaural Speech Enhancement with Dynamic Scene Perception Module

Two Heads Are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement.

Single-channel speech enhancement using improved progressive deep neural network and masking-based harmonic regeneration

A Speech Enhancement Algorithm Based on Computational Auditory Scene Analysis

Dual-stream Noise and Speech Information Perception based Speech Enhancement

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

Multi-scale Informative Perceptual Network for Monaural Speech Enhancement

Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Progressive Multi-Target Network Based Speech Enhancement with Snr-Preselection for Robust Speaker Diarization

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

Monaural Speech Enhancement with Deep Residual-Dense Lattice Network and Attention Mechanism in the Time Domain

A Speech Enhancement Algorithm Using Computational Auditory Scene Analysis with Spectral Subtraction

A Refining Underlying Information Framework for Monaural Speech Enhancement

Incorporation of a modified temporal cepstrum smoothing in both signal-to-noise ratio and speech presence probability estimation for speech enhancement

A Time Domain Progressive Learning Approach with SNR Constriction for Single-Channel Speech Enhancement and Recognition

A Dual Microphone Speech Enhancement Method With A Smoothing Parameter Mask

DBT-Net: Dual-branch Federative Magnitude and Phase Estimation with Attention-in-attention Transformer for Monaural Speech Enhancement

Exploring Conventional Enhancement and Separation Methods for Multi‐speech Enhancement in Indoor Environments

pDenoiser: A Personalized Speech Enhancement Neural Network for Pre-hospital Emergency Medical Services.

Speech Enhancement Based on Analysis–Synthesis Framework with Improved Parameter Domain Enhancement