Abstract:The decoupling-style concept begins to ignite in the speech enhancement area, which decouples the original complex spectrum estimation task into multiple easier sub-tasks (i.e., the magnitude-only recovery and residual complex spectrum estimation), resulting in better performance and easier interpretability. In this paper, we propose a dual-branch federative magnitude and phase estimation framework, dubbed DBT-Net, for monaural speech enhancement, aiming at recovering the coarse- and fine-grained regions of the overall spectrum in parallel. From the complementary perspective, the magnitude estimation branch is designed to filter out dominant noise components in the magnitude domain, while the complex spectrum purification branch is elaborately designed to inpaint the missing spectral details and implicitly estimate the phase information in the complex-valued spectral domain. To facilitate the information flow between each branch, interaction modules are introduced to leverage features learned from one branch, so as to suppress the undesired parts and recover the missing components of the other branch. Instead of adopting the conventional RNNs and temporal convolutional networks for sequence modeling, we employ a novel attention-in-attention transformer-based network within each branch for better feature learning. More specially, it is composed of several adaptive spectro-temporal attention transformer-based modules and an adaptive hierarchical attention module, aiming to capture long-term time-frequency dependencies and further aggregate intermediate hierarchical contextual information. Comprehensive evaluations on the WSJ0-SI84 + DNS-Challenge and VoiceBank + DEMAND dataset demonstrate that the proposed approach consistently outperforms previous advanced systems and yields state-of-the-art performance in terms of speech quality and intelligibility.

Streaming Dual-Path Transformer for Speech Enhancement

DPATD: Dual-Phase Audio Transformer for Denoising

DPT-FSNet: Dual-Path Transformer Based Full-Band and Sub-Band Fusion Network for Speech Enhancement

DCHT: Deep Complex Hybrid Transformer for Speech Enhancement

Dual-Branch Attention-In-Attention Transformer for Single-Channel Speech Enhancement

Efficient Encoder-Decoder and Dual-Path Conformer for Comprehensive Feature Learning in Speech Enhancement

Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation

Speech Perception Improvement Algorithm Based on a Dual-Path Long Short-Term Memory Network

Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement

DeFTAN-II: Efficient Multichannel Speech Enhancement with Subgroup Processing

DE-DPCTnet: Deep Encoder Dual-path Convolutional Transformer Network for Multi-channel Speech Separation

SETransformer: Speech Enhancement Transformer

A Dual-Staged Context Aggregation Method Towards Efficient End-To-End Speech Enhancement

DBT-Net: Dual-branch Federative Magnitude and Phase Estimation with Attention-in-attention Transformer for Monaural Speech Enhancement

Hybrid Attention Transformer Based on Dual-Path for Time-Domain Single-Channel Speech Separation

Time domain speech enhancement with CNN and time-attention transformer

DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

PhaseDCN: A Phase-Enhanced Dual-Path Dilated Convolutional Network for Single-Channel Speech Enhancement.

Transformer with Bidirectional Decoder for Speech Recognition