Abstract:The decoupling-style concept begins to ignite in the speech enhancement area, which decouples the original complex spectrum estimation task into multiple easier sub-tasks (i.e., the magnitude-only recovery and residual complex spectrum estimation), resulting in better performance and easier interpretability. In this paper, we propose a dual-branch federative magnitude and phase estimation framework, dubbed DBT-Net, for monaural speech enhancement, aiming at recovering the coarse- and fine-grained regions of the overall spectrum in parallel. From the complementary perspective, the magnitude estimation branch is designed to filter out dominant noise components in the magnitude domain, while the complex spectrum purification branch is elaborately designed to inpaint the missing spectral details and implicitly estimate the phase information in the complex-valued spectral domain. To facilitate the information flow between each branch, interaction modules are introduced to leverage features learned from one branch, so as to suppress the undesired parts and recover the missing components of the other branch. Instead of adopting the conventional RNNs and temporal convolutional networks for sequence modeling, we employ a novel attention-in-attention transformer-based network within each branch for better feature learning. More specially, it is composed of several adaptive spectro-temporal attention transformer-based modules and an adaptive hierarchical attention module, aiming to capture long-term time-frequency dependencies and further aggregate intermediate hierarchical contextual information. Comprehensive evaluations on the WSJ0-SI84 + DNS-Challenge and VoiceBank + DEMAND dataset demonstrate that the proposed approach consistently outperforms previous advanced systems and yields state-of-the-art performance in terms of speech quality and intelligibility.

A Novel Target Decoupling Framework Based on Waveform-Spectrum Fusion Network for Monaural Speech Enhancement

Decoupling-style monaural speech enhancement with a triple-branch cross-domain fusion network

DMF-Net: A decoupling-style multi-band fusion model for real-time full-band speech enhancement.

DMF-Net: A decoupling-style multi-band fusion model for full-band speech enhancement

PFRNet: Dual-Branch Progressive Fusion Rectification Network for Monaural Speech Enhancement.

DBT-Net: Dual-branch Federative Magnitude and Phase Estimation with Attention-in-attention Transformer for Monaural Speech Enhancement

CompNet: Complementary Network for Single-Channel Speech Enhancement.

Distortionless Multi-Channel Target Speech Enhancement for Overlapped Speech Recognition

FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement

Joint Time-Frequency and Time Domain Learning for Speech Enhancement

FSI-Net: A dual-stage full- and sub-band integration network for full-band speech enhancement

FB-MSTCN: A Full-Band Single-Channel Speech Enhancement Method Based on Multi-Scale Temporal Convolutional Network

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

A Hybrid Deep-Learning Approach for Single Channel HF-SSB Speech Enhancement

Channel and Temporal-Frequency Attention UNet for Monaural Speech Enhancement.

Efficient Monaural Speech Enhancement using Spectrum Attention Fusion

ForkNet: Simultaneous Time and Time-Frequency Domain Modeling for Speech Enhancement

A time-frequency fusion model for multi-channel speech enhancement

Two Heads Are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement.

DBNet: A Dual-branch Network Architecture Processing on Spectrum and Waveform for Single-channel Speech Enhancement

Multi-Scale Feature Fusion Transformer Network for End-to-End Single Channel Speech Separation